feat(observability): add fine-grained OTel spans for the plugin pipeline and model selector by gyliu513 · Pull Request #165 · llm-d/llm-d-inference-payload-processor

gyliu513 · 2026-06-13T17:14:25Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds fine-grained OpenTelemetry spans for IPP's request/response plugin pipeline and the model-selector pipeline (filters → scorers → picker), so a single sampled request shows which plugins ran, in what order, and how long each took, instead of one opaque gateway.request span.

Until now the entire request lifecycle produced exactly one span.

Which issue(s) this PR fixes:

Fixes #163

Release note (write NONE if no user-facing change):

Adds fine-grained OpenTelemetry spans for IPP's request/response plugin pipeline and the model-selector pipeline

gyliu513 · 2026-06-13T17:14:35Z

/cc @nirrozenbaum

nirrozenbaum · 2026-06-14T12:13:12Z

+	defer stageSpan.End()
+
 	for _, reqPlugin := range reqPlugins {
+		name := reqPlugin.TypedName()


nit: maybe better to name this typedName?
it reads a bit strange name.Type and name.Name.

Suggested change

name := reqPlugin.TypedName()

typedName := reqPlugin.TypedName()

Done, renamed to typedName here and in the response/filter/scorer/picker loops for consistency. Thanks.

nirrozenbaum · 2026-06-14T12:14:42Z

+		pluginCtx, span := tracer.Start(ctx, "plugin."+name.Type,
+			trace.WithSpanKind(trace.SpanKindInternal),
+			trace.WithAttributes(
+				attribute.String("llm_d.plugin.extension_point", requestPluginExtensionPoint),
+				attribute.String("llm_d.plugin.type", name.Type),
+				attribute.String("llm_d.plugin.name", name.Name),
+			))


I'm not a tracing expert so forgive me if the question is obvious..
why do we specify type twice?
once in plugin.name.Type and another one in attribute?

Good question, it's intentional.

The span name (plugin.type)is the human-readable label shown in trace UIs; it's meant to be low-cardinality and isn't reliably queryable across tracing backends.

The llm_d.plugin.type attribute is the structured, indexable dimension you filter and aggregate on (e.g. "p99 latency grouped by plugin type"). Per OTel conventions, the span name is for display and attributes carry the queryable data, so keeping both is by design.

nirrozenbaum · 2026-06-14T12:17:53Z

+	defer stageSpan.End()
+
 	for _, respPlugin := range respPlugins {
+		name := respPlugin.TypedName()


ditto, typedName

nirrozenbaum · 2026-06-14T12:18:51Z


+	tracer := tracing.Tracer(modelSelectorTracerScope)
 	for _, filter := range p.filters {
+		name := filter.TypedName()


nirrozenbaum · 2026-06-14T12:23:25Z

+	if result != nil && result.TargetModel != nil {
+		span.SetAttributes(attribute.String("llm_d.picker.selected_model", result.TargetModel.GetName()))
+	}


Picker is guaranteed to select a target model. we can remove the conditional.

Suggested change

if result != nil && result.TargetModel != nil {

span.SetAttributes(attribute.String("llm_d.picker.selected_model", result.TargetModel.GetName()))

}

span.SetAttributes(attribute.String("llm_d.picker.selected_model", result.TargetModel.GetName()))

nirrozenbaum · 2026-06-14T12:25:54Z

+			span.RecordError(err)
+			span.SetStatus(codes.Error, err.Error())
+		} else if result != nil && result.TargetModel != nil {
+			span.SetAttributes(attribute.String("llm_d.model_selector.selected_model", result.TargetModel.GetName()))


this would fit better before the log line Model selection completed. without the conditionals.

nirrozenbaum

@gyliu513 thanks, overall looks good.
left few minor comments.

gyliu513

Thanks @nirrozenbaum

gyliu513 · 2026-06-14T12:52:05Z

+	defer stageSpan.End()
+
 	for _, reqPlugin := range reqPlugins {
+		name := reqPlugin.TypedName()


Done, renamed to typedName here and in the response/filter/scorer/picker loops for consistency. Thanks.

gyliu513 · 2026-06-14T12:52:56Z

+		pluginCtx, span := tracer.Start(ctx, "plugin."+name.Type,
+			trace.WithSpanKind(trace.SpanKindInternal),
+			trace.WithAttributes(
+				attribute.String("llm_d.plugin.extension_point", requestPluginExtensionPoint),
+				attribute.String("llm_d.plugin.type", name.Type),
+				attribute.String("llm_d.plugin.name", name.Name),
+			))


Good question, it's intentional.

The span name (plugin.type)is the human-readable label shown in trace UIs; it's meant to be low-cardinality and isn't reliably queryable across tracing backends.

The llm_d.plugin.type attribute is the structured, indexable dimension you filter and aggregate on (e.g. "p99 latency grouped by plugin type"). Per OTel conventions, the span name is for display and attributes carry the queryable data, so keeping both is by design.

gyliu513 · 2026-06-14T12:53:20Z

+	defer stageSpan.End()
+
 	for _, respPlugin := range respPlugins {
+		name := respPlugin.TypedName()


gyliu513 · 2026-06-14T12:53:33Z


+	tracer := tracing.Tracer(modelSelectorTracerScope)
 	for _, filter := range p.filters {
+		name := filter.TypedName()


gyliu513 · 2026-06-14T12:54:16Z

+			span.RecordError(err)
+			span.SetStatus(codes.Error, err.Error())
+		} else if result != nil && result.TargetModel != nil {
+			span.SetAttributes(attribute.String("llm_d.model_selector.selected_model", result.TargetModel.GetName()))


gyliu513 · 2026-06-14T14:05:26Z

+	if result != nil && result.TargetModel != nil {
+		span.SetAttributes(attribute.String("llm_d.picker.selected_model", result.TargetModel.GetName()))
+	}


shaneutt · 2026-06-15T18:06:47Z

+		span.End()
 	}

 	return nil


Seems that we may want to add more test coverage for these changes:

We need a test that does this:

Set up a tracetest.SpanRecorder (as done in telemetry_test.go).

Run runRequestPlugins with one or more fake plugins.

Assert that a request_plugins stage span is created with child plugin.* spans.

Assert that a failing plugin produces a span with codes.Error status.

If I'm not mistaken, I don't believe we have that currently which would leave an opportunity for regressions.

I'll add a runRequestPlugins test using a tracetest.SpanRecorder (mirroring telemetry_test.go) with fake plugins that asserts the request_plugins stage span, nested plugin.* child spans, and a codes.Error status when a plugin fails.

shaneutt · 2026-06-15T18:09:01Z

Should we now use the new tracing.Tracer helper?

Yes, I'll fix it. It was merged after I created this PR.

shaneutt · 2026-06-15T18:09:12Z

A new constant for this was added above but we're not using it here yet.

Yes, I'll fix it. It was merged after I created this PR.

shaneutt · 2026-06-15T18:11:29Z

 	before := time.Now()
-	result := p.picker.Pick(ctx, cycleState, scoredModels)
-	metrics.RecordPluginProcessingLatency(pickerExtensionPoint, p.picker.TypedName().Type, p.picker.TypedName().Name, time.Since(before))
+	result := p.picker.Pick(spanCtx, cycleState, scoredModels)


It appears that this call may be able to return a nil value for TargetModel?

If it does we will panic on GetName() below.

Hi @shaneutt , @nirrozenbaum post a comment here and I agree that the Picker is guaranteed to select a target model, do we still need the guard?

shaneutt · 2026-06-15T18:11:37Z

+		debugLogger.Info("Completed running picker plugin", "plugin", typedName, "result", result)
 	}

 	return result


I think we're in a situation with this one, as before, where test coverage needs expansion.

Yes, will add a test case

shaneutt · 2026-06-15T18:11:53Z

 		return nil, err
 	}

+	span.SetAttributes(attribute.String("llm_d.model_selector.selected_model", result.TargetModel.GetName()))


Similar to above: no nil guards

This one is already safe — lines 91-94 guard immediately

shmuelk · 2026-06-18T13:34:43Z


+// instrumentationName is the default OTel instrumentation scope used when no
+// explicit scope is supplied to Tracer.
+const instrumentationName = "llm-d-inference-payload-processor"


I recommend shortning the name. How about:

Suggested change

const instrumentationName = "llm-d-inference-payload-processor"

const instrumentationName = "llm-d-ipp"

shmuelk · 2026-06-18T13:35:10Z

+const instrumentationName = "llm-d-inference-payload-processor"
+
+// Tracer returns a tracer for the given instrumentation scope, defaulting to
+// "llm-d-inference-payload-processor". The build version and commit SHA are


Suggested change

// "llm-d-inference-payload-processor". The build version and commit SHA are

// "llm-d-ipp". The build version and commit SHA are

gyliu513 · 2026-06-18T16:17:07Z

+
+	// handlersTracerScope is the OTel instrumentation scope for spans emitted by
+	// the request/response handlers, following the package-path naming convention.
+	handlersTracerScope = "llm-d-inference-payload-processor/pkg/handlers"


@shmuelk with your above comment, do we need to update here as well to

Suggested change

handlersTracerScope = "llm-d-inference-payload-processor/pkg/handlers"

handlersTracerScope = "llm-d-ipp/pkg/handlers"

/cc @nirrozenbaum @shaneutt

gyliu513 · 2026-06-18T16:17:48Z

+	// modelSelectorTracerScope is the OTel instrumentation scope for spans
+	// emitted by the model-selector pipeline, following the package-path
+	// naming convention.
+	modelSelectorTracerScope = "llm-d-inference-payload-processor/pkg/modelselector"


@shmuelk ditto here, how about

Suggested change

modelSelectorTracerScope = "llm-d-inference-payload-processor/pkg/modelselector"

modelSelectorTracerScope = "llm-d-ipp/pkg/modelselector"

…ine and model selector Signed-off-by: Guangya Liu <gyliu513@gmail.com>

Signed-off-by: Guangya Liu <gyliu513@gmail.com>

github-actions Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. labels Jun 13, 2026

github-actions Bot requested a review from nirrozenbaum June 13, 2026 17:14

nirrozenbaum reviewed Jun 14, 2026

View reviewed changes

gyliu513 commented Jun 14, 2026

View reviewed changes

gyliu513 force-pushed the feat/issue-163-pipeline-spans branch from b367c50 to ea46ec9 Compare June 14, 2026 14:08

shaneutt reviewed Jun 15, 2026

View reviewed changes

shmuelk reviewed Jun 18, 2026

View reviewed changes

gyliu513 commented Jun 18, 2026

View reviewed changes

gyliu513 added 3 commits June 22, 2026 11:00

feat(observability): add fine-grained OTel spans for the plugin pipel…

5bf24cb

…ine and model selector Signed-off-by: Guangya Liu <gyliu513@gmail.com>

address comments from Nir

bb7aa6b

Signed-off-by: Guangya Liu <gyliu513@gmail.com>

address comments from shaneutt

5b3594e

Signed-off-by: Guangya Liu <gyliu513@gmail.com>

gyliu513 force-pushed the feat/issue-163-pipeline-spans branch from bacd262 to 5b3594e Compare June 22, 2026 15:02

	name := reqPlugin.TypedName()
	typedName := reqPlugin.TypedName()

	const instrumentationName = "llm-d-inference-payload-processor"
	const instrumentationName = "llm-d-ipp"

	// "llm-d-inference-payload-processor". The build version and commit SHA are
	// "llm-d-ipp". The build version and commit SHA are

	handlersTracerScope = "llm-d-inference-payload-processor/pkg/handlers"
	handlersTracerScope = "llm-d-ipp/pkg/handlers"

	modelSelectorTracerScope = "llm-d-inference-payload-processor/pkg/modelselector"
	modelSelectorTracerScope = "llm-d-ipp/pkg/modelselector"

Conversation

gyliu513 commented Jun 13, 2026

Uh oh!

gyliu513 commented Jun 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum left a comment

Choose a reason for hiding this comment

Uh oh!

gyliu513 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gyliu513 Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gyliu513 Jun 15, 2026 •

edited

Loading