feat(sdk)!: provider builder build() returns Result instead of panicking#3462
feat(sdk)!: provider builder build() returns Result instead of panicking#3462lazureykis wants to merge 1 commit into
Conversation
657f1a4 to
5832e63
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3462 +/- ##
======================================
Coverage 83.2% 83.3%
======================================
Files 128 128
Lines 25086 25199 +113
======================================
+ Hits 20896 21013 +117
+ Misses 4190 4186 -4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
4f8a2fa to
abbee3d
Compare
TracerProviderBuilder::build() and LoggerProviderBuilder::build() now return Result<_, ProviderBuildError> instead of panicking on invalid configuration. BatchSpanProcessor::builder().build() and BatchLogProcessor::builder().build() now return Result<_, ProviderBuildError> instead of panicking when no async runtime is available. Removes with_batch_exporter() convenience methods from provider builders. Users now construct batch processors explicitly, which makes the fallible step visible in their code and avoids hidden panics. BREAKING CHANGES: - TracerProviderBuilder::build() returns Result<SdkTracerProvider, ProviderBuildError> - LoggerProviderBuilder::build() returns Result<SdkLoggerProvider, ProviderBuildError> - BatchSpanProcessor::builder().build() returns Result<BatchSpanProcessor, ProviderBuildError> - BatchLogProcessor::builder().build() returns Result<BatchLogProcessor, ProviderBuildError> - Removed TracerProviderBuilder::with_batch_exporter() - Removed LoggerProviderBuilder::with_batch_exporter() Assisted-by: Claude Opus 4.6
abbee3d to
6020391
Compare
|
Hey @lazureykis thanks for opening this PR! Anyway, to that end, it would be great if you could add some text to that tracking issue about how you came across this and the misery it caused to help make it clear this needs addressing. |
Thanks Scott! I played with the API to eliminate the panic using 3 different approaches and this PR is what I ended up with. Curious what @cijothomas thinks. |
|
https://github.com/open-telemetry/opentelemetry-rust#project-status |
|
Adding helpers to the OTLP crate would only cover OTLP users and the panic in BatchLogProcessor remains reachable through the SDK. |
You are fully correct. However, my position still remains the same - we cannot take breaking change to stable components, so OTLP Exporter builder is the only place we can address now. I think that should address majority of user feedback. Fixing BatchProcessor has to wait for v2.0 where we can make breaking changes. |
|
Thank you for your contribution! This PR has been automatically marked as stale because it has not had activity in the last 14 days. This may be due to a delay in review on our side or awaiting a response from you; either is fine, and we appreciate your patience. It will be closed in 14 days if no further activity occurs. Pushing a new commit or leaving a comment will remove the stale label and keep the PR open. |
BatchSpanProcessor and BatchLogProcessor previously panicked during shutdown via .unwrap() on thread operations. If the worker thread had panicked, handle.join().unwrap() would trigger a *second* panic that masked the original failure, and a poisoned handle mutex would panic on lock().unwrap(). shutdown_with_timeout() now joins the worker via a join_worker_thread() helper that converts a worker-thread panic into OTelSdkError::InternalFailure (preserving the original panic message), and recovers a poisoned handle mutex via into_inner() instead of unwrapping. The lock is released before the blocking join. This needs no API change, since shutdown already returns OTelSdkResult. A shared panic_message() helper is added to util.rs and covered by unit tests alongside the join_worker_thread helpers. The thread-spawn failure in the constructors remains a fail-fast panic: it is an unrecoverable setup-time error (without the worker the processor can never export), which the spec permits failing on at initialization. The panic now surfaces the underlying OS error. Returning a Result there is a breaking change tracked for the OTLP exporter builder / v2.0 via open-telemetry#2690 and open-telemetry#3462. Fixes open-telemetry#3375
Fixes #2690
Design discussion issue (if applicable) #2673
Changes
Makes
build()returnResultfor all SDK provider and batch processor builders, eliminatingtwo categories of panics at initialization time:
Thread spawn failures (
ProviderBuildError::ThreadSpawnFailed) —BatchSpanProcessorandBatchLogProcessorboth spawn a background OS thread. On resource-exhausted systems thisthread::Builder::spawn()call can fail; previously it wouldexpect()and crash the process.Async-runtime/processor mismatch (
ProviderBuildError::AsyncRuntimeRequired) — The batchprocessor runs on a dedicated OS thread using blocking calls (since 0.28). Async HTTP clients
like
reqwest::ClientandHyperClientrequire a Tokio reactor, which isn't available on thatthread — causing a panic at export time with "there is no reactor running". This is a common
pitfall because
reqwest::Client(async) is the default in many Rust applications.This PR surfaces that misconfiguration at build time.
SpanExporterandLogExportergain arequires_async_runtime() -> boolmethod (default:false); OTLP exporters propagate thisflag from their underlying
HttpClient. The check is best-effort: third-party exporters thatdo not override the method continue to return
falseand may still panic at export time.Build-time detection is preferred over simply documenting "don't use async clients" because:
reqwest::Clientis what most Rust developers reach for firstdiagnose
HttpClientimplementations can opt into the same safety net by overridingrequires_async_runtime()ProviderBuildError::AsyncRuntimeRequiredis better DX than a tokio reactor panicRemoved
with_batch_exporter()convenience method — from bothTracerProviderBuilderandLoggerProviderBuilder. Users now build the batch processor explicitly and pass it viawith_span_processor()/with_log_processor(). This eliminates the design tension of havinga fallible operation hidden inside an infallible builder method (see design rationale below).
API changes (breaking)
TracerProviderBuilder::build() -> SdkTracerProvider-> Result<SdkTracerProvider, ProviderBuildError>LoggerProviderBuilder::build() -> SdkLoggerProvider-> Result<SdkLoggerProvider, ProviderBuildError>BatchSpanProcessorBuilder::build() -> BatchSpanProcessor-> Result<BatchSpanProcessor, ProviderBuildError>BatchLogProcessorBuilder::build() -> BatchLogProcessor-> Result<BatchLogProcessor, ProviderBuildError>TracerProviderBuilder::with_batch_exporter()with_span_processor()LoggerProviderBuilder::with_batch_exporter()with_log_processor()ProviderBuildErroris#[non_exhaustive]to allow future expansion (e.g. detecting additionalinvalid client combinations as noted in #2690).
Design rationale: removing
with_batch_exporter()with_batch_exporter()was a convenience wrapper that internally constructed aBatchSpanProcessor— a fallible operation (thread spawn, async-runtime validation). Thiscreated a design problem: how should a builder method that returns
Selfreport errors?Three approaches were considered:
Deferred error (
pending_errorfield) — Store the error in the builder and surface itat
build(). This is whatreqwest::ClientBuilderdoes. The problem: hidden state, and ifwith_batch_exporter()is called twice with two invalid exporters, only the last error isreported (the first is silently dropped).
Factory closure (
PendingProcessorenum) — Store a closure that captures the exporterand defer construction to
build(). Correct and order-preserving, but novel — no major Rustcrate uses this pattern — and adds complexity (closure boxing, manual
Debugimpls).Remove the convenience method entirely — Require users to build the batch processor
explicitly via
BatchSpanProcessor::builder(exporter).build()?and pass it towith_span_processor(). This follows thetracing-subscriberpattern (.with(layer)takespre-built layers) and the broader Rust ecosystem consensus: builder setters are infallible,
build()is the single fallible call site.Option 3 was chosen because it matches the dominant pattern across the Rust ecosystem
(
tracing-subscriber,axum,tower,tokio,rdkafka— all take pre-built componentsrather than wrapping fallible construction in convenience methods). The two-line migration is
straightforward and the explicit processor construction gives users access to
BatchConfigBuilderfor customization — something
with_batch_exporter()did not expose.Provider
build()infallibility noteBoth
TracerProviderBuilder::build()andLoggerProviderBuilder::build()are currentlyinfallible at the provider level — the
Resultreturn type is prophylactic for futurevalidation. The only errors today come from batch processor construction
(
BatchSpanProcessor::builder().build()/BatchLogProcessor::builder().build()), whichhappens before the provider
build()call.SdkTracerProvider::Defaultuses.expect()which is safe because the default builderhas no batch processors.
Scope
This PR covers both
TracerProviderandLoggerProviderin a single change so the two providersstay consistent and the breaking change is delivered atomically.
Migration
Merge requirement checklist
CHANGELOG.mdfiles updated for non-trivial, user-facing changes