feat(observability): add fine-grained OTel spans for the plugin pipeline and model selector#165
feat(observability): add fine-grained OTel spans for the plugin pipeline and model selector#165gyliu513 wants to merge 3 commits into
Conversation
|
/cc @nirrozenbaum |
| defer stageSpan.End() | ||
|
|
||
| for _, reqPlugin := range reqPlugins { | ||
| name := reqPlugin.TypedName() |
There was a problem hiding this comment.
nit: maybe better to name this typedName?
it reads a bit strange name.Type and name.Name.
| name := reqPlugin.TypedName() | |
| typedName := reqPlugin.TypedName() |
There was a problem hiding this comment.
Done, renamed to typedName here and in the response/filter/scorer/picker loops for consistency. Thanks.
| pluginCtx, span := tracer.Start(ctx, "plugin."+name.Type, | ||
| trace.WithSpanKind(trace.SpanKindInternal), | ||
| trace.WithAttributes( | ||
| attribute.String("llm_d.plugin.extension_point", requestPluginExtensionPoint), | ||
| attribute.String("llm_d.plugin.type", name.Type), | ||
| attribute.String("llm_d.plugin.name", name.Name), | ||
| )) |
There was a problem hiding this comment.
I'm not a tracing expert so forgive me if the question is obvious..
why do we specify type twice?
once in plugin.name.Type and another one in attribute?
There was a problem hiding this comment.
Good question, it's intentional.
The span name (plugin.type)is the human-readable label shown in trace UIs; it's meant to be low-cardinality and isn't reliably queryable across tracing backends.
The llm_d.plugin.type attribute is the structured, indexable dimension you filter and aggregate on (e.g. "p99 latency grouped by plugin type"). Per OTel conventions, the span name is for display and attributes carry the queryable data, so keeping both is by design.
| defer stageSpan.End() | ||
|
|
||
| for _, respPlugin := range respPlugins { | ||
| name := respPlugin.TypedName() |
|
|
||
| tracer := tracing.Tracer(modelSelectorTracerScope) | ||
| for _, filter := range p.filters { | ||
| name := filter.TypedName() |
| if result != nil && result.TargetModel != nil { | ||
| span.SetAttributes(attribute.String("llm_d.picker.selected_model", result.TargetModel.GetName())) | ||
| } |
There was a problem hiding this comment.
Picker is guaranteed to select a target model. we can remove the conditional.
| if result != nil && result.TargetModel != nil { | |
| span.SetAttributes(attribute.String("llm_d.picker.selected_model", result.TargetModel.GetName())) | |
| } | |
| span.SetAttributes(attribute.String("llm_d.picker.selected_model", result.TargetModel.GetName())) |
| span.RecordError(err) | ||
| span.SetStatus(codes.Error, err.Error()) | ||
| } else if result != nil && result.TargetModel != nil { | ||
| span.SetAttributes(attribute.String("llm_d.model_selector.selected_model", result.TargetModel.GetName())) |
There was a problem hiding this comment.
this would fit better before the log line Model selection completed. without the conditionals.
nirrozenbaum
left a comment
There was a problem hiding this comment.
@gyliu513 thanks, overall looks good.
left few minor comments.
| defer stageSpan.End() | ||
|
|
||
| for _, reqPlugin := range reqPlugins { | ||
| name := reqPlugin.TypedName() |
There was a problem hiding this comment.
Done, renamed to typedName here and in the response/filter/scorer/picker loops for consistency. Thanks.
| pluginCtx, span := tracer.Start(ctx, "plugin."+name.Type, | ||
| trace.WithSpanKind(trace.SpanKindInternal), | ||
| trace.WithAttributes( | ||
| attribute.String("llm_d.plugin.extension_point", requestPluginExtensionPoint), | ||
| attribute.String("llm_d.plugin.type", name.Type), | ||
| attribute.String("llm_d.plugin.name", name.Name), | ||
| )) |
There was a problem hiding this comment.
Good question, it's intentional.
The span name (plugin.type)is the human-readable label shown in trace UIs; it's meant to be low-cardinality and isn't reliably queryable across tracing backends.
The llm_d.plugin.type attribute is the structured, indexable dimension you filter and aggregate on (e.g. "p99 latency grouped by plugin type"). Per OTel conventions, the span name is for display and attributes carry the queryable data, so keeping both is by design.
| defer stageSpan.End() | ||
|
|
||
| for _, respPlugin := range respPlugins { | ||
| name := respPlugin.TypedName() |
|
|
||
| tracer := tracing.Tracer(modelSelectorTracerScope) | ||
| for _, filter := range p.filters { | ||
| name := filter.TypedName() |
| span.RecordError(err) | ||
| span.SetStatus(codes.Error, err.Error()) | ||
| } else if result != nil && result.TargetModel != nil { | ||
| span.SetAttributes(attribute.String("llm_d.model_selector.selected_model", result.TargetModel.GetName())) |
| if result != nil && result.TargetModel != nil { | ||
| span.SetAttributes(attribute.String("llm_d.picker.selected_model", result.TargetModel.GetName())) | ||
| } |
b367c50 to
ea46ec9
Compare
| span.End() | ||
| } | ||
|
|
||
| return nil |
There was a problem hiding this comment.
Seems that we may want to add more test coverage for these changes:
We need a test that does this:
- Set up a
tracetest.SpanRecorder(as done intelemetry_test.go). - Run
runRequestPluginswith one or more fake plugins. - Assert that a
request_pluginsstage span is created with childplugin.*spans. - Assert that a failing plugin produces a span with
codes.Errorstatus.
If I'm not mistaken, I don't believe we have that currently which would leave an opportunity for regressions.
There was a problem hiding this comment.
I'll add a runRequestPlugins test using a tracetest.SpanRecorder (mirroring telemetry_test.go) with fake plugins that asserts the request_plugins stage span, nested plugin.* child spans, and a codes.Error status when a plugin fails.
There was a problem hiding this comment.
Should we now use the new tracing.Tracer helper?
There was a problem hiding this comment.
Yes, I'll fix it. It was merged after I created this PR.
There was a problem hiding this comment.
A new constant for this was added above but we're not using it here yet.
There was a problem hiding this comment.
Yes, I'll fix it. It was merged after I created this PR.
| before := time.Now() | ||
| result := p.picker.Pick(ctx, cycleState, scoredModels) | ||
| metrics.RecordPluginProcessingLatency(pickerExtensionPoint, p.picker.TypedName().Type, p.picker.TypedName().Name, time.Since(before)) | ||
| result := p.picker.Pick(spanCtx, cycleState, scoredModels) |
There was a problem hiding this comment.
It appears that this call may be able to return a nil value for TargetModel?
If it does we will panic on GetName() below.
There was a problem hiding this comment.
Hi @shaneutt , @nirrozenbaum post a comment here and I agree that the Picker is guaranteed to select a target model, do we still need the guard?
| debugLogger.Info("Completed running picker plugin", "plugin", typedName, "result", result) | ||
| } | ||
|
|
||
| return result |
There was a problem hiding this comment.
I think we're in a situation with this one, as before, where test coverage needs expansion.
There was a problem hiding this comment.
Yes, will add a test case
| return nil, err | ||
| } | ||
|
|
||
| span.SetAttributes(attribute.String("llm_d.model_selector.selected_model", result.TargetModel.GetName())) |
There was a problem hiding this comment.
Similar to above: no nil guards
There was a problem hiding this comment.
This one is already safe — lines 91-94 guard immediately
|
|
||
| // instrumentationName is the default OTel instrumentation scope used when no | ||
| // explicit scope is supplied to Tracer. | ||
| const instrumentationName = "llm-d-inference-payload-processor" |
There was a problem hiding this comment.
I recommend shortning the name. How about:
| const instrumentationName = "llm-d-inference-payload-processor" | |
| const instrumentationName = "llm-d-ipp" |
| const instrumentationName = "llm-d-inference-payload-processor" | ||
|
|
||
| // Tracer returns a tracer for the given instrumentation scope, defaulting to | ||
| // "llm-d-inference-payload-processor". The build version and commit SHA are |
There was a problem hiding this comment.
| // "llm-d-inference-payload-processor". The build version and commit SHA are | |
| // "llm-d-ipp". The build version and commit SHA are |
|
|
||
| // handlersTracerScope is the OTel instrumentation scope for spans emitted by | ||
| // the request/response handlers, following the package-path naming convention. | ||
| handlersTracerScope = "llm-d-inference-payload-processor/pkg/handlers" |
There was a problem hiding this comment.
@shmuelk with your above comment, do we need to update here as well to
| handlersTracerScope = "llm-d-inference-payload-processor/pkg/handlers" | |
| handlersTracerScope = "llm-d-ipp/pkg/handlers" |
| // modelSelectorTracerScope is the OTel instrumentation scope for spans | ||
| // emitted by the model-selector pipeline, following the package-path | ||
| // naming convention. | ||
| modelSelectorTracerScope = "llm-d-inference-payload-processor/pkg/modelselector" |
There was a problem hiding this comment.
@shmuelk ditto here, how about
| modelSelectorTracerScope = "llm-d-inference-payload-processor/pkg/modelselector" | |
| modelSelectorTracerScope = "llm-d-ipp/pkg/modelselector" |
…ine and model selector Signed-off-by: Guangya Liu <gyliu513@gmail.com>
Signed-off-by: Guangya Liu <gyliu513@gmail.com>
Signed-off-by: Guangya Liu <gyliu513@gmail.com>
bacd262 to
5b3594e
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds fine-grained OpenTelemetry spans for IPP's request/response plugin pipeline and the model-selector pipeline (filters → scorers → picker), so a single sampled request shows which plugins ran, in what order, and how long each took, instead of one opaque
gateway.requestspan.Until now the entire request lifecycle produced exactly one span.
Which issue(s) this PR fixes:
Fixes #163
Release note (write
NONEif no user-facing change):