Skip to content

feat(observability): add fine-grained OTel spans for the plugin pipeline and model selector#165

Open
gyliu513 wants to merge 3 commits into
llm-d:mainfrom
gyliu513:feat/issue-163-pipeline-spans
Open

feat(observability): add fine-grained OTel spans for the plugin pipeline and model selector#165
gyliu513 wants to merge 3 commits into
llm-d:mainfrom
gyliu513:feat/issue-163-pipeline-spans

Conversation

@gyliu513

Copy link
Copy Markdown
Member

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds fine-grained OpenTelemetry spans for IPP's request/response plugin pipeline and the model-selector pipeline (filters → scorers → picker), so a single sampled request shows which plugins ran, in what order, and how long each took, instead of one opaque gateway.request span.

Until now the entire request lifecycle produced exactly one span.

Which issue(s) this PR fixes:

Fixes #163

Release note (write NONE if no user-facing change):

Adds fine-grained OpenTelemetry spans for IPP's request/response plugin pipeline and the model-selector pipeline

@github-actions github-actions Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. labels Jun 13, 2026
@gyliu513

Copy link
Copy Markdown
Member Author

/cc @nirrozenbaum

@github-actions github-actions Bot requested a review from nirrozenbaum June 13, 2026 17:14
Comment thread pkg/handlers/request.go Outdated
defer stageSpan.End()

for _, reqPlugin := range reqPlugins {
name := reqPlugin.TypedName()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe better to name this typedName?
it reads a bit strange name.Type and name.Name.

Suggested change
name := reqPlugin.TypedName()
typedName := reqPlugin.TypedName()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, renamed to typedName here and in the response/filter/scorer/picker loops for consistency. Thanks.

Comment thread pkg/handlers/request.go Outdated
Comment on lines +148 to +154
pluginCtx, span := tracer.Start(ctx, "plugin."+name.Type,
trace.WithSpanKind(trace.SpanKindInternal),
trace.WithAttributes(
attribute.String("llm_d.plugin.extension_point", requestPluginExtensionPoint),
attribute.String("llm_d.plugin.type", name.Type),
attribute.String("llm_d.plugin.name", name.Name),
))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a tracing expert so forgive me if the question is obvious..
why do we specify type twice?
once in plugin.name.Type and another one in attribute?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, it's intentional.

The span name (plugin.type)is the human-readable label shown in trace UIs; it's meant to be low-cardinality and isn't reliably queryable across tracing backends.

The llm_d.plugin.type attribute is the structured, indexable dimension you filter and aggregate on (e.g. "p99 latency grouped by plugin type"). Per OTel conventions, the span name is for display and attributes carry the queryable data, so keeping both is by design.

Comment thread pkg/handlers/response.go Outdated
defer stageSpan.End()

for _, respPlugin := range respPlugins {
name := respPlugin.TypedName()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, typedName

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


tracer := tracing.Tracer(modelSelectorTracerScope)
for _, filter := range p.filters {
name := filter.TypedName()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +302 to +304
if result != nil && result.TargetModel != nil {
span.SetAttributes(attribute.String("llm_d.picker.selected_model", result.TargetModel.GetName()))
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picker is guaranteed to select a target model. we can remove the conditional.

Suggested change
if result != nil && result.TargetModel != nil {
span.SetAttributes(attribute.String("llm_d.picker.selected_model", result.TargetModel.GetName()))
}
span.SetAttributes(attribute.String("llm_d.picker.selected_model", result.TargetModel.GetName()))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread pkg/modelselector/modelselector.go Outdated
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
} else if result != nil && result.TargetModel != nil {
span.SetAttributes(attribute.String("llm_d.model_selector.selected_model", result.TargetModel.GetName()))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would fit better before the log line Model selection completed. without the conditionals.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@nirrozenbaum nirrozenbaum left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyliu513 thanks, overall looks good.
left few minor comments.

@gyliu513 gyliu513 left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nirrozenbaum

Comment thread pkg/handlers/request.go Outdated
defer stageSpan.End()

for _, reqPlugin := range reqPlugins {
name := reqPlugin.TypedName()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, renamed to typedName here and in the response/filter/scorer/picker loops for consistency. Thanks.

Comment thread pkg/handlers/request.go Outdated
Comment on lines +148 to +154
pluginCtx, span := tracer.Start(ctx, "plugin."+name.Type,
trace.WithSpanKind(trace.SpanKindInternal),
trace.WithAttributes(
attribute.String("llm_d.plugin.extension_point", requestPluginExtensionPoint),
attribute.String("llm_d.plugin.type", name.Type),
attribute.String("llm_d.plugin.name", name.Name),
))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, it's intentional.

The span name (plugin.type)is the human-readable label shown in trace UIs; it's meant to be low-cardinality and isn't reliably queryable across tracing backends.

The llm_d.plugin.type attribute is the structured, indexable dimension you filter and aggregate on (e.g. "p99 latency grouped by plugin type"). Per OTel conventions, the span name is for display and attributes carry the queryable data, so keeping both is by design.

Comment thread pkg/handlers/response.go Outdated
defer stageSpan.End()

for _, respPlugin := range respPlugins {
name := respPlugin.TypedName()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


tracer := tracing.Tracer(modelSelectorTracerScope)
for _, filter := range p.filters {
name := filter.TypedName()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread pkg/modelselector/modelselector.go Outdated
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
} else if result != nil && result.TargetModel != nil {
span.SetAttributes(attribute.String("llm_d.model_selector.selected_model", result.TargetModel.GetName()))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +302 to +304
if result != nil && result.TargetModel != nil {
span.SetAttributes(attribute.String("llm_d.picker.selected_model", result.TargetModel.GetName()))
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@gyliu513 gyliu513 force-pushed the feat/issue-163-pipeline-spans branch from b367c50 to ea46ec9 Compare June 14, 2026 14:08
Comment thread pkg/handlers/request.go
span.End()
}

return nil

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that we may want to add more test coverage for these changes:

We need a test that does this:

  1. Set up a tracetest.SpanRecorder (as done in telemetry_test.go).
  2. Run runRequestPlugins with one or more fake plugins.
  3. Assert that a request_plugins stage span is created with child plugin.* spans.
  4. Assert that a failing plugin produces a span with codes.Error status.

If I'm not mistaken, I don't believe we have that currently which would leave an opportunity for regressions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a runRequestPlugins test using a tracetest.SpanRecorder (mirroring telemetry_test.go) with fake plugins that asserts the request_plugins stage span, nested plugin.* child spans, and a codes.Error status when a plugin fails.

Comment thread pkg/handlers/server.go Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we now use the new tracing.Tracer helper?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll fix it. It was merged after I created this PR.

Comment thread pkg/handlers/server.go Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new constant for this was added above but we're not using it here yet.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll fix it. It was merged after I created this PR.

before := time.Now()
result := p.picker.Pick(ctx, cycleState, scoredModels)
metrics.RecordPluginProcessingLatency(pickerExtensionPoint, p.picker.TypedName().Type, p.picker.TypedName().Name, time.Since(before))
result := p.picker.Pick(spanCtx, cycleState, scoredModels)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that this call may be able to return a nil value for TargetModel?

If it does we will panic on GetName() below.

@gyliu513 gyliu513 Jun 15, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shaneutt , @nirrozenbaum post a comment here and I agree that the Picker is guaranteed to select a target model, do we still need the guard?

debugLogger.Info("Completed running picker plugin", "plugin", typedName, "result", result)
}

return result

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're in a situation with this one, as before, where test coverage needs expansion.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add a test case

return nil, err
}

span.SetAttributes(attribute.String("llm_d.model_selector.selected_model", result.TargetModel.GetName()))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above: no nil guards

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is already safe — lines 91-94 guard immediately


// instrumentationName is the default OTel instrumentation scope used when no
// explicit scope is supplied to Tracer.
const instrumentationName = "llm-d-inference-payload-processor"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend shortning the name. How about:

Suggested change
const instrumentationName = "llm-d-inference-payload-processor"
const instrumentationName = "llm-d-ipp"

const instrumentationName = "llm-d-inference-payload-processor"

// Tracer returns a tracer for the given instrumentation scope, defaulting to
// "llm-d-inference-payload-processor". The build version and commit SHA are

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// "llm-d-inference-payload-processor". The build version and commit SHA are
// "llm-d-ipp". The build version and commit SHA are

Comment thread pkg/handlers/server.go

// handlersTracerScope is the OTel instrumentation scope for spans emitted by
// the request/response handlers, following the package-path naming convention.
handlersTracerScope = "llm-d-inference-payload-processor/pkg/handlers"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shmuelk with your above comment, do we need to update here as well to

Suggested change
handlersTracerScope = "llm-d-inference-payload-processor/pkg/handlers"
handlersTracerScope = "llm-d-ipp/pkg/handlers"

/cc @nirrozenbaum @shaneutt

// modelSelectorTracerScope is the OTel instrumentation scope for spans
// emitted by the model-selector pipeline, following the package-path
// naming convention.
modelSelectorTracerScope = "llm-d-inference-payload-processor/pkg/modelselector"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shmuelk ditto here, how about

Suggested change
modelSelectorTracerScope = "llm-d-inference-payload-processor/pkg/modelselector"
modelSelectorTracerScope = "llm-d-ipp/pkg/modelselector"

gyliu513 added 3 commits June 22, 2026 11:00
…ine and model selector

Signed-off-by: Guangya Liu <gyliu513@gmail.com>
Signed-off-by: Guangya Liu <gyliu513@gmail.com>
Signed-off-by: Guangya Liu <gyliu513@gmail.com>
@gyliu513 gyliu513 force-pushed the feat/issue-163-pipeline-spans branch from bacd262 to 5b3594e Compare June 22, 2026 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[observability] Add fine-grained OpenTelemetry spans for the plugin pipeline and model selector, aligned with llm-d-router#1483

4 participants