Reserve already_shutdown as error.type value for SDK processor processed metrics#3710
Reserve already_shutdown as error.type value for SDK processor processed metrics#3710cijothomas wants to merge 8 commits into
already_shutdown as error.type value for SDK processor processed metrics#3710Conversation
|
This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days. |
already_shutdown as error.type value for SDK processor processed metrics
|
CC @dashpole |
|
It would be nice to have a PR with a prototype. Wouldn't the SDK itself also need such metric as some spans/logs may not even start being processed given the SDK was shutdown? |
open-telemetry/opentelemetry-rust#3514 Is there a concept of shutting down an SDK? We have shutdown on each providers (which bubble it through to processors and exporters..), but there is nothing called sdk.shutdown. Can you clarify what is meant by SDK shutdown? |
I meant provider shutdown. |
Got it. This part is already covered by spec. If a provider is shutdown, then new getTracer() returns |
There is another gap you helped me realize! if provider.GetTracer(...) is called after shutdown — spec says return a no-op Tracer, so spans never enter the SDK and no existing metric fires. A good fix to bring visibility into that is probably a separate, small follow-up: a counter on the provider (e.g., otel.sdk.tracer.created with error.type=already_shutdown) so users get a signal when GetTracer returns no-op post-shutdown. Its good to be followed up, I'll open a tracking issue. |
dashpole
left a comment
There was a problem hiding this comment.
Overall, this error type seems reasonable. I'm a bit concerned that our description for the error.type field in sdk telemetry will be really long if we enumerate all of the errors. But we can cross that bridge when we get there.
| A low-cardinality description of the failure reason. SDK Batching Log Record Processors MUST use `queue_full` for log records dropped due to a full queue. | ||
| examples: ["queue_full"] | ||
| A low-cardinality description of the failure reason. | ||
| SDK Batching Log Record Processors MUST use `queue_full` for log records dropped due to a full queue. |
There was a problem hiding this comment.
nit:
| SDK Batching Log Record Processors MUST use `queue_full` for log records dropped due to a full queue. | |
| note: | |
| SDK Batching Log Record Processors MUST use `queue_full` for log records dropped due to a full queue. |
It's getting long, let's keep brief short and move details to note.
There was a problem hiding this comment.
Had to move the note to the overall metric level note. Please see if this is fine!
Co-authored-by: Liudmila Molkova <neskazu@gmail.com>
|
@JonasKunz Could you review? |
Reserves
already_shutdownas anerror.typevalue onotel.sdk.processor.span.processedandotel.sdk.processor.log.processed, used when a processor drops items because it has already been shut down. Applies to any SDK processor (not just batching).Why the processor is the right place
TracerProvider.Shutdownonly requires newGetTracer(...)calls to return a no-op; it doesn't invalidate pre-existing Tracers or affect in-flight Spans (span.End()callsSpanProcessor.OnEnddirectly).SpanProcessor.Shutdownexplicitly anticipates this: "subsequent calls to OnStart/OnEnd/ForceFlush are not allowed. SDKs SHOULD ignore these calls gracefully." So the drop is observed at the processor and recorded aserror.type=already_shutdown.Note:
GetTracer(...)after shutdown returns a no-op Tracer — spans never enter the SDK, so no metric fires. Tracked as a follow-up.Related: