OpenTelemetry tracing#542
Conversation
* Integrate Bifrost LLM gateway library for unified provider access Replace individual LLM provider implementations (OpenAI, Anthropic, Azure, Bedrock, Cohere, Mistral) with the Bifrost gateway library for chat completions. Key changes: - Add new bifrost/ package implementing llm.LanguageModel interface - BifrostLLM adapter handles message conversion, tool calling, and streaming - Maps ServiceConfig to Bifrost provider configuration - Update bots/bots.go to use Bifrost for supported providers - Keep asage provider as fallback for unsupported types - Retain openai package for transcription and embeddings (not covered by Bifrost) - Retain anthropic/bedrock packages for FetchModels and evals This provides a unified API for multiple LLM providers while maintaining backwards compatibility with existing configurations. * Add Bifrost support for embeddings and transcription Extend Bifrost integration to cover additional LLM capabilities: Embeddings: - Add BifrostEmbeddingProvider implementing embeddings.EmbeddingProvider interface - Support CreateEmbedding and BatchCreateEmbeddings via Bifrost gateway - Add "bifrost" provider type to embeddings configuration - Support OpenAI, Azure, Cohere, and Bedrock providers for embeddings Transcription: - Add BifrostTranscriber for audio-to-text conversion via Bifrost - Support VTT format output compatible with existing subtitles parsing - Configurable model and provider selection This provides unified access to embedding and transcription APIs through the Bifrost gateway, allowing users to choose their preferred provider while maintaining a consistent interface. * Remove legacy provider packages in favor of Bifrost This commit completes the migration to using Bifrost as the unified LLM gateway by: - Deleting openai/, anthropic/, bedrock/, and asage/ packages - Removing ASage service type support from configuration - Updating bots.go to use Bifrost exclusively for LLM and transcription - Updating search/embeddings.go to use Bifrost for embedding providers - Fixing type assignment issues in getLLM with interface types - Using service DefaultModel for transcription (fallback to whisper-1) - Cleaning up config.go by removing OpenAI-specific helper functions All LLM providers are now accessed through the Bifrost gateway library, providing a consistent interface across OpenAI, Anthropic, Azure, OpenAI-compatible, and Bedrock providers. * Improve Bifrost configuration to use built-in features - Use NetworkConfig.BaseURL for OpenAI Compatible and other providers that support custom API endpoints (Azure uses AzureKeyConfig instead) - Pass OpenAI OrgID via NetworkConfig.ExtraHeaders - Configure built-in retry logic (2 retries with 1s-10s backoff) - Remove unused httpClient from providerAccount struct (Bifrost manages its own HTTP client internally) * Update Bifrost integration for newer library API * Responses API * Fix CI * Mod tidy * Add reasoning and annotation streaming to Bifrost responses * Restore missing functionality in Bifrost migration - Add auto-run tool resolution wrapper with recursive loop - Honor ReasoningDisabled per-request config - Route native web search through Responses API for all providers - Stream reasoning/thinking events in chat API path - Restore Anthropic thinking budget auto-calculation - Preserve thinking blocks in conversation history - Merge consecutive same-role messages for Anthropic - Fix transcription model to always use whisper-1 - Pass embedding dimensions in requests - Remove unused httpClient parameter from Bifrost * Fix review issues in Bifrost migration * /v1 issue * Fix remaining e2e failures: reasoning summaries, mock model, and embedding URL - Set reasoning.Summary = "auto" for OpenAI Responses API so reasoning summary events are emitted in the stream (matches old SDK behavior) - Add defaultModel to mock service configs (Bifrost requires it) - Pass APIURL for OpenAI embedding provider for consistency - Normalize embedding provider URLs to strip /v1 suffix * Fix some time issues * Fix e2e flake mode * Lint * Annotations fixed * Fix for Anthropic annotations --------- Co-authored-by: Claude <noreply@anthropic.com>
Bring agents-v2 up to date with master, incorporating 8 commits: token accounting, bot config hot-reload, license gating, e2e observability, model constant updates, MCP OAuth fix, docs, and RHS prompt hint removal. Resolved conflicts in bots/bots.go (imports), evals/evals.go (model constants), and accepted deletion of openai/ files superseded by the bifrost migration. Made-with: Cursor
Phase 0 of OpenTelemetry integration: clean up observability code that will be replaced by OTel spans or was never used. - Remove EnableLLMTrace config field and LanguageModelLogWrapper - Remove TraceResolved/TraceUnknown from ToolStore, simplify NewToolStore - Remove dead LLMetrics interface, llmMetrics struct, GetMetricsForAIService, IncrementLLMRequests, and unused agents_llm_requests_total counter - Update webapp config, e2e tests, and all callers Co-authored-by: christopher <christopher@mattermost.com>
Phase 1: Create telemetry/ package with OTLP gRPC exporter setup, semantic attribute constants, and TracerProvider lifecycle management. Add EnableOpenTelemetry and OpenTelemetryEndpoint config fields. Wire Init/Shutdown into plugin OnActivate/OnDeactivate. Co-authored-by: christopher <christopher@mattermost.com>
Update all remaining implementations, mocks, and callers of the LanguageModel interface to accept ctx context.Context as the first parameter to ChatCompletion and ChatCompletionNoStream. Implementations updated: - llm/auto_run_tools_test.go (testLLM, capturingLLM) - llm/token_tracking_test.go (MockLanguageModel) - llm/token_tracking_bench_test.go (benchFakeLLM) - llm/mocks/language_model_mock.go (generated mock) - mmtools/web_search_test.go (mockLanguageModel) - api/fake_llm_test.go (FakeLLM) Production callers updated: - conversations/conversations.go - conversations/tool_handling.go - channels/channels.go - threads/threads.go - search/search.go - react/react.go - meetings/meeting_summarization.go - api/api_llm_bridge.go - mmtools/web_search.go - evals/graders.go Test callers and mock expectations updated: - llm/auto_run_tools_test.go - llm/token_tracking_test.go - llm/token_tracking_bench_test.go - threads/threads_test.go - react/react_test.go Uses context.Background() for callers that don't have a context available yet. Uses stdcontext import alias where the context parameter name shadows the context package. Co-authored-by: christopher <christopher@mattermost.com>
Phase 2b-2d: Replace context.Background() placeholders with real request-scoped context from HTTP handlers and plugin hooks. Fix streaming to use passed ctx as parent instead of context.Background(). Co-authored-by: christopher <christopher@mattermost.com>
Phase 3a: Add otelgin.Middleware to Gin router for automatic HTTP span creation with method, route, and status code. Phase 3b: Instrument bifrost ChatCompletion with spans that capture provider, model, operation, token usage, and errors. Errors are recorded on spans with proper status codes. Co-authored-by: christopher <christopher@mattermost.com>
Co-authored-by: christopher <christopher@mattermost.com>
Phase 4: Add comprehensive tests for telemetry Init/Shutdown, Tracer, SpanFromContext, WithLLMAttributes, span hierarchy, and context propagation. Fix schema URL mismatch between semconv versions by using resource.New instead of resource.Merge. Co-authored-by: christopher <christopher@mattermost.com>
Phase 5: Add dev/docker-compose.otel.yml with Jaeger all-in-one container for local development. Run with: docker compose -f dev/docker-compose.otel.yml up -d Then browse http://localhost:16686 for the Jaeger UI. Configure plugin with EnableOpenTelemetry=true and OpenTelemetryEndpoint=localhost:4317 to start exporting traces. Co-authored-by: christopher <christopher@mattermost.com>
|
Cursor Agent can help with this pull request. Just |
🤖 LLM Evaluation ResultsOpenAI
❌ Failed EvaluationsShow 7 failuresOPENAI1. TestReactEval/[openai]_react_cat_message
2. TestConversationMentionHandling/[openai]_conversation_from_attribution_long_thread.json
3. TestConversationMentionHandling/[openai]_conversation_from_attribution_long_thread.json
4. TestConversationMentionHandling/[openai]_conversation_from_attribution_long_thread.json
5. TestConversationMentionHandling/[openai]_conversation_from_attribution_long_thread.json
6. TestConversationMentionHandling/[openai]_conversation_from_attribution_long_thread.json
7. TestDirectMessageConversations/[openai]_bot_dm_tool_introspection
Anthropic
❌ Failed EvaluationsShow 7 failuresANTHROPIC1. TestReactEval/[anthropic]_react_cat_message
2. TestConversationMentionHandling/[anthropic]_conversation_from_attribution_long_thread.json
3. TestConversationMentionHandling/[anthropic]_conversation_from_attribution_long_thread.json
4. TestConversationMentionHandling/[anthropic]_conversation_from_attribution_long_thread.json
5. TestConversationMentionHandling/[anthropic]_conversation_from_attribution_long_thread.json
6. TestConversationMentionHandling/[anthropic]_conversation_from_attribution_long_thread.json
7. TestDirectMessageConversations/[anthropic]_bot_dm_tool_introspection
This comment was automatically generated by the eval CI pipeline. |
Add 6 integration tests verifying end-to-end span export: - LLM completion spans with attributes and token usage - Error spans with proper status codes and exception events - Tool resolution spans (success and unknown tool error) - Parent-child span hierarchy (HTTP -> LLM) - Full request trace (HTTP -> process -> LLM -> tool) Fix ResolveTool to record errors on spans for both unknown tools and resolver failures. Co-authored-by: christopher <christopher@mattermost.com>
- CLAUDE.md: Add tracing section covering telemetry package, context threading, span instrumentation patterns, config, local testing, and the stdcontext aliasing convention. - docs/admin_guide.md: Add comprehensive OpenTelemetry section with what gets traced, how to enable, local Jaeger setup, production deployment guidance, and custom attribute reference table. Co-authored-by: christopher <christopher@mattermost.com>
Co-authored-by: christopher <christopher@mattermost.com>
Resolve conflicts from 24 master commits including Bifrost migration, structured output support, MCP stateless sessions, response placeholder pattern, automated invoker detection, and e2e CI sharding. Key merge decisions: - Keep new StructuredOutputFallbackWrapper with ctx parameter - Adopt master's respondToPost/createResponsePlaceholder pattern with ctx - Keep isAutomatedInvoker/computeAllowToolsInChannel from master - Update new LanguageModel implementations (structured output wrapper) to match interface with context.Context parameter - Thread ctx through new streamResponseToExistingPost and respondToPost Co-authored-by: christopher <christopher@mattermost.com>
Resolve conflicts from 5 master commits (per-tool config, dependency updates, Anthropic fix, Entry SKU, search fix). New code updated for OpenTelemetry: - Add ctx to wrapStreamWithMCPAutoApproval and pass to ResolveTool - Add ctx to AutoExecuteApprovedToolCalls -> HandleToolCall call - Add ctx to capturingLanguageModel test mock - Fix NewToolStore() calls in new tests (old 2-arg signature removed) - Fix ToolAutoRunKey usage in auto_run_tools_test with ctx param Co-authored-by: christopher <christopher@mattermost.com>
Resolve conflicts from 17 master commits including: - Go module rename from mattermost-plugin-ai to mattermost-plugin-agents - Config migration from config.json to database - Auto-run everywhere tool policy - Channel tools auto-run restriction for bot invocations - LLM proxy support - Development environment setup - Removed built-in GitHub/Jira tools - Tool failure recovery in agent loop - License header updates - Documentation additions All import paths updated to mattermost-plugin-agents. OTel tracing (ctx params, telemetry spans, telemetry imports) preserved throughout. Co-authored-by: christopher <christopher@mattermost.com>
Replace the legacy Enable LLM Trace toggle in the Debug panel with: - Enable OpenTelemetry (boolean toggle) - OpenTelemetry Endpoint (text field, disabled when OTel is off) EnableLLMTrace is fully redundant: our OTel spans capture LLM calls, tool execution, errors, and token usage in a structured, queryable way via any OTLP backend (Jaeger, Grafana Tempo, etc.), replacing the old approach of dumping full conversation data to plugin logs. Also clean up stale EnableLLMTrace references that were re-introduced during master merges in test mocks, store tests, and e2e helpers. Co-authored-by: christopher <christopher@mattermost.com>
Add context.Context to evalStreamLogger.ChatCompletion and ChatCompletionNoStream, fix NewToolStore() call signature, and add context.Background() to ChatCompletion call in tools_eval_test. Co-authored-by: christopher <christopher@mattermost.com>
Co-authored-by: christopher <christopher@mattermost.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7d5b31d392
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Create analysis post with conversation ID | ||
| analysisPost := a.makeAnalysisPost(user.Locale, post.Id, data.AnalysisType, analyzeResult.ConversationID) | ||
| if err := a.streamingService.StreamToNewDM(stdcontext.Background(), botUserID, analyzeResult.Stream, user.Id, analysisPost, post.Id); err != nil { | ||
| if err := a.streamingService.StreamToNewDM(c.Request.Context(), botUserID, analyzeResult.Stream, user.Id, analysisPost, post.Id); err != nil { |
There was a problem hiding this comment.
Detach request cancellation before starting DM stream
Passing c.Request.Context() into StreamToNewDM causes the stream context to be canceled as soon as the HTTP handler returns, because StreamToNewDM starts streaming in a goroutine from a child context (GetStreamingContext uses context.WithCancel(inCtx)), and StreamToPost exits on <-ctx.Done(). In this flow the analysis response can be truncated or canceled immediately after the API call completes, so users may get partial/empty DM outputs even when generation succeeded.
Useful? React with 👍 / 👎.
| if cfg.EnableOpenTelemetry { | ||
| shutdown, telErr := telemetry.Init(context.Background(), "mattermost-ai-agents", manifest.Version, cfg.OpenTelemetryEndpoint) |
There was a problem hiding this comment.
Reinitialize telemetry when OpenTelemetry config changes
Telemetry initialization is only executed during OnActivate, so changing enableOpenTelemetry or openTelemetryEndpoint at runtime will not apply until the plugin is deactivated/reactivated. Because config updates in this plugin are normally propagated via registered update listeners, this makes the new System Console controls effectively stale after save and can leave tracing disabled (or pointed at an old endpoint) until a manual restart.
Useful? React with 👍 / 👎.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds OpenTelemetry tracing and request-context propagation across the plugin: new telemetry package and attributes, OTel init/shutdown, Gin middleware, instrumented spans for LLM/Bifrost, tool resolution, MCP, web searches and streaming; many APIs and wrappers accept context.Context; config/UI replaced LLM-trace toggle with OpenTelemetry settings; local Grafana/Tempo dev stack and telemetry tests added. ChangesTelemetry + Context Propagation (single cohesive DAG)
Sequence Diagram(s)sequenceDiagram
participant Client as HTTP Client
participant Gin as Gin API
participant Telemetry as OTel Tracer
participant Service as Plugin Service
participant ToolRunner as ToolRunner
participant LLM as Bifrost/LLM
participant Exporter as OTLP Exporter
Client->>Gin: HTTP request
Gin->>Telemetry: Start request span
Gin->>Service: Handler(ctx)
Service->>Telemetry: Start operation span
Service->>ToolRunner: Run(ctx, ...)
ToolRunner->>Telemetry: Start "resolve tool" span
ToolRunner->>LLM: ChatCompletion(ctx, req)
LLM->>Telemetry: Start "llm chat completion" span
LLM->>Telemetry: Add attributes/events (tokens, errors)
Telemetry->>Exporter: Export spans
LLM-->>Service: Stream/result
Service-->>Gin: Response
Gin-->>Client: HTTP response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
websearch/google.go (1)
93-100:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMissing error recording on span for additional failure paths.
The span only records errors for the HTTP request failure (lines 87-88), but two other error paths return errors without updating the span status:
- Non-OK HTTP status (line 94)
- JSON decode failure (line 99)
This creates observability gaps where the span appears successful despite the operation failing.
🔧 Proposed fix to record all errors on the span
if resp.StatusCode != http.StatusOK { - return nil, fmt.Errorf("web search request failed: status %s", resp.Status) + err := fmt.Errorf("web search request failed: status %s", resp.Status) + span.RecordError(err) + span.SetStatus(codes.Error, err.Error()) + return nil, err } var payload googleSearchResponse if err := json.NewDecoder(resp.Body).Decode(&payload); err != nil { - return nil, fmt.Errorf("failed to decode web search response: %w", err) + span.RecordError(err) + span.SetStatus(codes.Error, err.Error()) + return nil, fmt.Errorf("failed to decode web search response: %w", err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@websearch/google.go` around lines 93 - 100, The span isn't updated on the non-OK HTTP status and JSON decode failure paths; update the span before returning error in both cases by recording the error (span.RecordError(...)) and setting the span status to error (span.SetStatus(codes.Error, "web search failed" or the specific message)) for the non-OK response check around resp and for the JSON decode error when decoding into payload; use the same span variable used earlier and the otel codes package (codes.Error) so traces reflect these failures.search/search.go (1)
268-291:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSpan ends before async work completes.
The span created at line 268 ends when
RunSearchreturns (viadefer span.End()at line 269), butprocessSearchcontinues executing asynchronously on line 291. This means:
- The "run search" span will be recorded as complete before the actual search work finishes.
- Any child spans created within
processSearch(e.g., fromChatCompletion) will either be orphaned or have incorrect timing relationships.Consider either:
- Starting the span inside
processSearchinstead, or- Using
trace.ContextWithSpanto create a detached context for the goroutine that doesn't rely on the parent span's lifecycle.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@search/search.go` around lines 268 - 291, The "run search" span created with telemetry.Tracer().Start in RunSearch is ended before the asynchronous work in s.processSearch completes (defer span.End()), so move span lifecycle into the goroutine: remove the span creation/defer in RunSearch and instead start and end a new span at the top of s.processSearch (use telemetry.Tracer().Start(ctx, "run search") inside processSearch), passing the original ctx into s.processSearch as before; alternatively, if you prefer to keep the parent span in RunSearch, create a detached context for the goroutine (e.g., use trace.ContextWithSpan or a new context.Background() and start a fresh span in the goroutine) so the child spans in s.processSearch are not orphaned when RunSearch returns.
🧹 Nitpick comments (8)
conversation/service_test.go (1)
7-8: 💤 Low valueRedundant import alias.
Both imports refer to the same
contextpackage. Thestdcontextalias is only needed whencontext *llm.Contextshadows the package name, which doesn't occur in this test file. Consider using just one import.Suggested fix
import ( "context" - stdcontext "context" "encoding/json"Then update Lines 99 and 103 to use
context.Contextinstead ofstdcontext.Context.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@conversation/service_test.go` around lines 7 - 8, Remove the redundant alias import of the standard context package (stdcontext) and keep only the plain "context" import; then update any usages that reference stdcontext.Context (notably in the test functions where stdcontext.Context is used) to use context.Context instead so the file consistently uses the standard context package and the duplicate alias is eliminated.conversation/service.go (1)
518-539: ⚡ Quick win
GenerateTitleshould acceptcontext.Contextfor proper tracing propagation.Per the coding guidelines, all functions in the request pipeline should accept
ctx context.Contextas the first parameter. WhileGenerateTitleis called asynchronously, passing a context would enable cancellation and trace propagation from the caller. Currently,stdcontext.Background()creates an orphaned span that won't link to the parent request trace.Proposed signature change
func (s *Service) GenerateTitle( + ctx stdcontext.Context, conversationID string, lm llm.LanguageModel, userMessage string, context *llm.Context, ) error { request := "Write a short title for the following request. Include only the title and nothing else, no quotations. Request:\n" + userMessage req := llm.CompletionRequest{ Posts: []llm.Post{ {Role: llm.PostRoleUser, Message: request}, }, Context: context, Operation: llm.OperationTitleGeneration, OperationSubType: llm.SubTypeNoStream, } - title, err := lm.ChatCompletionNoStream(stdcontext.Background(), req, + title, err := lm.ChatCompletionNoStream(ctx, req,Callers can then pass a detached context with timeout if needed:
ctx, cancel := stdcontext.WithTimeout(stdcontext.Background(), 30*time.Second) defer cancel() go s.GenerateTitle(ctx, convID, lm, msg, llmCtx)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@conversation/service.go` around lines 518 - 539, GenerateTitle currently creates an orphan context with stdcontext.Background(), breaking trace/cancellation propagation; change the method signature to func (s *Service) GenerateTitle(ctx context.Context, conversationID string, lm llm.LanguageModel, userMessage string, context *llm.Context) error, pass that ctx into the lm.ChatCompletionNoStream call instead of stdcontext.Background(), and update all callers to supply an appropriate context (e.g., a parent request ctx or a timed context) when invoking Service.GenerateTitle so tracing and cancellation flow correctly.mcp/client.go (1)
435-448: ⚖️ Poor tradeoffIncomplete span error recording.
The span records errors for the nil-session case but not for subsequent failures (reconnection errors at Lines 467, 472, 483; tool call errors at Lines 490, 493; and result errors at Lines 512-514, 521). This creates inconsistent trace visibility where some MCP failures appear as errors and others don't.
Consider a helper pattern
To avoid cluttering every error path, wrap the error return:
recordErr := func(err error) error { span.RecordError(err) span.SetStatus(codes.Error, err.Error()) return err }Then use
return "", recordErr(err)at each error return site.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp/client.go` around lines 435 - 448, The span started by telemetry.Tracer().Start (named via "mcp call tool") records errors only for the initial nil-session check; add a small helper closure (e.g., recordErr := func(err error) error { span.RecordError(err); span.SetStatus(codes.Error, err.Error()); return err }) immediately after creating ctx, span and use it at every subsequent error return (reconnection errors around the reconnect calls, tool call errors from the call to session.Call toolName, and result/parse errors when handling the returned result) by replacing returns like `return "", err` with `return "", recordErr(err)` so all failure paths consistently record errors on the span.telemetry/telemetry.go (1)
34-40: 💤 Low valueConsider making TLS configuration controllable.
The OTLP exporter is hardcoded to use insecure (unencrypted) gRPC connections. This is fine for local development or internal collectors, but could be a security concern for production deployments where the collector is accessed over untrusted networks.
Consider adding a configuration option to enable TLS, or at minimum document this limitation in the admin guide.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@telemetry/telemetry.go` around lines 34 - 40, The OTLP exporter is forced to use insecure gRPC via otlptracegrpc.WithInsecure(); add a configurable TLS option so production can use encrypted connections: extend the telemetry configuration (e.g., add a TLS boolean and optional cert/CA path or serverName) and, in the exporter creation logic around otlptracegrpc.New, choose between otlptracegrpc.WithInsecure() and supplying proper transport credentials (e.g., grpc/credentials created from the configured certs or system roots) based on that config; update the exporter variable creation (exporter, err := otlptracegrpc.New(...)) to build the appropriate dial options and make sure imports and error handling reflect the new branch.llm/tools.go (1)
322-335: ⚡ Quick winAdd Go doc for the changed exported ToolStore APIs.
NewToolStoreandResolveToolare exported symbols changed in this PR and still have no doc comments.As per coding guidelines "Document all public APIs".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/tools.go` around lines 322 - 335, Add missing Go doc comments for the exported APIs: NewToolStore and ResolveTool (also add docs for AddTools if it is exported) by placing a descriptive comment immediately above each function: for NewToolStore describe it returns a pointer to a ToolStore initialized with empty tools and authErrors; for ResolveTool document the purpose (resolving a tool by name), explain the parameters (ctx context.Context, name string, argsGetter ToolArgumentGetter, llmCtx *Context) and the return values (resolved string and error), and note any side effects or error conditions; ensure comments follow GoDoc style (start with the function name) so public API is documented.llm/structured_output_fallback.go (1)
24-29: ⚡ Quick winAdd Go doc for the exported wrapper methods.
ChatCompletionandChatCompletionNoStreamare part of the public wrapper surface and were touched here, but they still have no doc comments.As per coding guidelines "Document all public APIs".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/structured_output_fallback.go` around lines 24 - 29, Add Go doc comments above the exported methods of StructuredOutputFallbackWrapper: provide a short sentence for ChatCompletion describing that it performs a streaming chat completion by delegating to the wrapped LanguageModel and returns a TextStreamResult, and a short sentence for ChatCompletionNoStream describing that it performs a non-streaming chat completion by delegating to the wrapped LanguageModel and returns the full string response and error; place the comments immediately above the ChatCompletion and ChatCompletionNoStream method declarations using the exact function names.conversations/handle_messages.go (1)
85-85: ⚡ Quick winAdd GoDoc for exported
MessageHasBeenPosted.This exported method is missing a doc comment, which violates the public API documentation rule.
As per coding guidelines, "Document all public APIs".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@conversations/handle_messages.go` at line 85, Add a GoDoc comment above the exported method MessageHasBeenPosted describing its purpose and behavior; start the comment with "MessageHasBeenPosted ..." and mention the receiver (*Conversations), the parameters (_ *plugin.Context, post *model.Post) and any important side effects (e.g., how posts are handled or events emitted) and error conditions if applicable so the public API is documented.toolrunner/toolrunner.go (1)
315-320: ⚡ Quick winSet OTel span error status explicitly on resolver failures.
You already add a
telemetry.ToolStatusattribute, but failed tool spans should also record the error and set span status so backends/alerts classify them as failures.Suggested patch
import ( "context" "encoding/json" "fmt" "strings" "github.com/mattermost/mattermost-plugin-agents/llm" "github.com/mattermost/mattermost-plugin-agents/telemetry" + "go.opentelemetry.io/otel/codes" "go.opentelemetry.io/otel/trace" ) @@ if resolveErr != nil { + span.RecordError(resolveErr) + span.SetStatus(codes.Error, resolveErr.Error()) span.SetAttributes(telemetry.ToolStatus.String("error")) } else { + span.SetStatus(codes.Ok, "") span.SetAttributes(telemetry.ToolStatus.String("success")) } span.End()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@toolrunner/toolrunner.go` around lines 315 - 320, The span for tool resolution currently only sets a telemetry.ToolStatus attribute but does not mark the OpenTelemetry span as an error; when resolveErr != nil, call span.RecordError(resolveErr) and set the span status to error (e.g., span.SetStatus(codes.Error, resolveErr.Error())) before ending the span so backends classify failures correctly; update the error branch in the ToolRunner resolver handling (where resolveErr, span, telemetry.ToolStatus are used) to record the error and set the span status, leaving the success branch to set success attribute and appropriate OK status if desired.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/admin_guide.md`:
- Line 356: The fenced code block showing the trace example is missing a
language identifier which triggers markdownlint MD040; update the opening fence
(the triple-backticks that precede "HTTP POST /post/:postid/react") to include a
language tag such as text (e.g., ```text) so the block becomes a highlighted
plain-text fenced code block and resolves the MD040 lint warning.
In `@e2e/tests/system-console/debug-panel.spec.ts`:
- Around line 29-33: RADIO_INDICES for enableTokenUsageLogging points at the
wrong radio controls due to an added BooleanItem (Enable OpenTelemetry) shifting
indices; update the RADIO_INDICES constant's enableTokenUsageLogging mapping
(the enableTokenUsageLogging key inside RADIO_INDICES) to use the corrected pair
of indices (increment both current indices by 2 so the true/false values point
to the actual Token Usage Logging radios), or replace the hard-coded numbers
with a small helper that computes offsets based on the preceding BooleanItem
positions.
In `@go.mod`:
- Around line 29-33: The otlptracegrpc dependency in go.mod is pinned to v1.20.0
while the other OpenTelemetry modules (go.opentelemetry.io/otel,
go.opentelemetry.io/otel/sdk, go.opentelemetry.io/otel/trace) are at v1.43.0;
update the version for
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc to v1.43.0 in
go.mod so all OpenTelemetry modules use the same release series and then run `go
mod tidy` to refresh the dependency graph.
In `@llm/language_model.go`:
- Around line 29-32: Add concise method-level comments for the exported
LanguageModel interface and its two public methods: ChatCompletion and
ChatCompletionNoStream. For ChatCompletion, document that it accepts a
context.Context for cancellation/deadlines, returns a TextStreamResult that
provides streaming tokens/updates and may yield partial results, and describe
error semantics when the stream fails or is canceled. For
ChatCompletionNoStream, document that it performs a non-streaming completion,
returns the final string result, respects context cancellation/deadlines, and
describe error behavior for failures. Ensure the comments are brief, placed
immediately above each method signature, and follow the project's public API
comment style.
In `@llm/tools.go`:
- Around line 335-353: ResolveTool currently starts a request span but calls
tool.Resolver without passing the request context, so cancellation/tracing don't
propagate; update the ToolResolver type signature to accept ctx context.Context
as the first parameter (e.g. func(ctx context.Context, llmCtx *Context, args
ToolArgumentGetter) (string, error)) and change all Resolver implementations and
the call site in ToolStore.ResolveTool to pass the incoming ctx through to
tool.Resolver; if the file imports the context package as the identifier
"context" but also defines llm.Context, import the stdlib context package with
the alias stdcontext and use that alias in the new signatures and calls.
In `@search/search_conversation_test.go`:
- Around line 415-416: The mock setup for mockLLM.On("ChatCompletionNoStream",
...) uses only two arguments but the ChatCompletionNoStream signature now takes
a context.Context plus request and options (3+ args); update the expectation in
the test to match by calling mockLLM.On("ChatCompletionNoStream", mock.Anything,
mock.Anything, mock.Anything) (or otherwise include mock.Anything for the ctx
parameter and for the variadic opts) so the mock matches the
ChatCompletionNoStream method signature used elsewhere.
---
Outside diff comments:
In `@search/search.go`:
- Around line 268-291: The "run search" span created with
telemetry.Tracer().Start in RunSearch is ended before the asynchronous work in
s.processSearch completes (defer span.End()), so move span lifecycle into the
goroutine: remove the span creation/defer in RunSearch and instead start and end
a new span at the top of s.processSearch (use telemetry.Tracer().Start(ctx, "run
search") inside processSearch), passing the original ctx into s.processSearch as
before; alternatively, if you prefer to keep the parent span in RunSearch,
create a detached context for the goroutine (e.g., use trace.ContextWithSpan or
a new context.Background() and start a fresh span in the goroutine) so the child
spans in s.processSearch are not orphaned when RunSearch returns.
In `@websearch/google.go`:
- Around line 93-100: The span isn't updated on the non-OK HTTP status and JSON
decode failure paths; update the span before returning error in both cases by
recording the error (span.RecordError(...)) and setting the span status to error
(span.SetStatus(codes.Error, "web search failed" or the specific message)) for
the non-OK response check around resp and for the JSON decode error when
decoding into payload; use the same span variable used earlier and the otel
codes package (codes.Error) so traces reflect these failures.
---
Nitpick comments:
In `@conversation/service_test.go`:
- Around line 7-8: Remove the redundant alias import of the standard context
package (stdcontext) and keep only the plain "context" import; then update any
usages that reference stdcontext.Context (notably in the test functions where
stdcontext.Context is used) to use context.Context instead so the file
consistently uses the standard context package and the duplicate alias is
eliminated.
In `@conversation/service.go`:
- Around line 518-539: GenerateTitle currently creates an orphan context with
stdcontext.Background(), breaking trace/cancellation propagation; change the
method signature to func (s *Service) GenerateTitle(ctx context.Context,
conversationID string, lm llm.LanguageModel, userMessage string, context
*llm.Context) error, pass that ctx into the lm.ChatCompletionNoStream call
instead of stdcontext.Background(), and update all callers to supply an
appropriate context (e.g., a parent request ctx or a timed context) when
invoking Service.GenerateTitle so tracing and cancellation flow correctly.
In `@conversations/handle_messages.go`:
- Line 85: Add a GoDoc comment above the exported method MessageHasBeenPosted
describing its purpose and behavior; start the comment with
"MessageHasBeenPosted ..." and mention the receiver (*Conversations), the
parameters (_ *plugin.Context, post *model.Post) and any important side effects
(e.g., how posts are handled or events emitted) and error conditions if
applicable so the public API is documented.
In `@llm/structured_output_fallback.go`:
- Around line 24-29: Add Go doc comments above the exported methods of
StructuredOutputFallbackWrapper: provide a short sentence for ChatCompletion
describing that it performs a streaming chat completion by delegating to the
wrapped LanguageModel and returns a TextStreamResult, and a short sentence for
ChatCompletionNoStream describing that it performs a non-streaming chat
completion by delegating to the wrapped LanguageModel and returns the full
string response and error; place the comments immediately above the
ChatCompletion and ChatCompletionNoStream method declarations using the exact
function names.
In `@llm/tools.go`:
- Around line 322-335: Add missing Go doc comments for the exported APIs:
NewToolStore and ResolveTool (also add docs for AddTools if it is exported) by
placing a descriptive comment immediately above each function: for NewToolStore
describe it returns a pointer to a ToolStore initialized with empty tools and
authErrors; for ResolveTool document the purpose (resolving a tool by name),
explain the parameters (ctx context.Context, name string, argsGetter
ToolArgumentGetter, llmCtx *Context) and the return values (resolved string and
error), and note any side effects or error conditions; ensure comments follow
GoDoc style (start with the function name) so public API is documented.
In `@mcp/client.go`:
- Around line 435-448: The span started by telemetry.Tracer().Start (named via
"mcp call tool") records errors only for the initial nil-session check; add a
small helper closure (e.g., recordErr := func(err error) error {
span.RecordError(err); span.SetStatus(codes.Error, err.Error()); return err })
immediately after creating ctx, span and use it at every subsequent error return
(reconnection errors around the reconnect calls, tool call errors from the call
to session.Call toolName, and result/parse errors when handling the returned
result) by replacing returns like `return "", err` with `return "",
recordErr(err)` so all failure paths consistently record errors on the span.
In `@telemetry/telemetry.go`:
- Around line 34-40: The OTLP exporter is forced to use insecure gRPC via
otlptracegrpc.WithInsecure(); add a configurable TLS option so production can
use encrypted connections: extend the telemetry configuration (e.g., add a TLS
boolean and optional cert/CA path or serverName) and, in the exporter creation
logic around otlptracegrpc.New, choose between otlptracegrpc.WithInsecure() and
supplying proper transport credentials (e.g., grpc/credentials created from the
configured certs or system roots) based on that config; update the exporter
variable creation (exporter, err := otlptracegrpc.New(...)) to build the
appropriate dial options and make sure imports and error handling reflect the
new branch.
In `@toolrunner/toolrunner.go`:
- Around line 315-320: The span for tool resolution currently only sets a
telemetry.ToolStatus attribute but does not mark the OpenTelemetry span as an
error; when resolveErr != nil, call span.RecordError(resolveErr) and set the
span status to error (e.g., span.SetStatus(codes.Error, resolveErr.Error()))
before ending the span so backends classify failures correctly; update the error
branch in the ToolRunner resolver handling (where resolveErr, span,
telemetry.ToolStatus are used) to record the error and set the span status,
leaving the success branch to set success attribute and appropriate OK status if
desired.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: be5ed64a-0947-4d96-a059-ec54dc9f6aec
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (87)
CLAUDE.mdapi/api.goapi/api_channel.goapi/api_llm_bridge.goapi/api_no_tools_test.goapi/api_post.goapi/fake_llm_test.gobifrost/bifrost.gobifrost/bifrost_test.gobifrost/logger.gobifrost/models.gobifrost/tracer.gobifrost/tracer_test.gobots/bots.gobots/bots_test.gochannels/analysis_conversation_test.gochannels/channels.gochannels/channels_eval_test.goconfig/config.goconversation/service.goconversation/service_test.goconversations/bot_channel_tool_filter_test.goconversations/conversations.goconversations/conversations_test.goconversations/direct_message_eval_test.goconversations/dm_conversation_test.goconversations/handle_messages.goconversations/handle_messages_test.goconversations/regeneration.goconversations/tool_approval.goconversations/tool_policy_test.godev/docker-compose.otel.ymldev/grafana/provisioning/datasources/tempo.yamldev/tempo/tempo.yamldocs/admin_guide.mde2e/helpers/bot-config.tse2e/helpers/plugincontainer.tse2e/helpers/real-api-container.tse2e/helpers/system-console-container.tse2e/tests/multiplayer-tool-calling/multiplayer-tool-calling.spec.tse2e/tests/system-console/debug-panel.spec.tse2e/tests/system-console/live-service-full-flow.spec.tsevals/graders.gogo.modllm/language_model.gollm/logging.gollm/mocks/language_model_mock.gollm/structured_output_fallback.gollm/structured_output_fallback_test.gollm/token_tracking.gollm/token_tracking_bench_test.gollm/token_tracking_test.gollm/tools.gollm/tools_test.gollm/truncation.gollmcontext/llm_context.gollmcontext/llm_context_test.gomcp/client.gomcpserver/eval_helpers_test.gomcpserver/tools_eval_test.gomeetings/meeting_summarization.gometrics/metrics.gometrics/noop.gommtools/web_search.gommtools/web_search_test.goprompts/standard_personality_without_locale_test.goreact/react.goreact/react_test.gosearch/search.gosearch/search_conversation_test.gosearch/search_test.goserver/main.gostore/config_test.gostreaming/streaming.gotelemetry/attributes.gotelemetry/integration_test.gotelemetry/telemetry.gotelemetry/telemetry_test.gothreads/analysis_conversation_test.gothreads/threads.gothreads/threads_test.gotoolrunner/toolrunner.gotoolrunner/toolrunner_test.gowebapp/src/components/system_console/config.tsxwebapp/src/components/system_console/plugin_config_types.tsxwebsearch/brave.gowebsearch/google.go
💤 Files with no reviewable changes (14)
- bifrost/models.go
- e2e/helpers/system-console-container.ts
- e2e/helpers/bot-config.ts
- llmcontext/llm_context_test.go
- e2e/tests/multiplayer-tool-calling/multiplayer-tool-calling.spec.ts
- store/config_test.go
- bots/bots.go
- api/api_no_tools_test.go
- e2e/tests/system-console/live-service-full-flow.spec.ts
- e2e/helpers/plugincontainer.ts
- metrics/noop.go
- e2e/helpers/real-api-container.ts
- bots/bots_test.go
- metrics/metrics.go
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
telemetry/detach_test.go (1)
44-56: ⚡ Quick winPrefer a local tracer to avoid mutating the global OTel provider.
The save/restore pattern here is correct, but using a local tracer from
tpis cleaner and keeps the test fully isolated. Since the test only needs to verify that span relationships are preserved (not that a specific named tracer is used), you can replacetelemetry.Tracer()with a tracer obtained directly from the test provider.Suggested change
- "go.opentelemetry.io/otel" sdktrace "go.opentelemetry.io/otel/sdk/trace" "go.opentelemetry.io/otel/sdk/trace/tracetest" @@ exporter := tracetest.NewInMemoryExporter() tp := sdktrace.NewTracerProvider(sdktrace.WithSyncer(exporter)) - prev := otel.GetTracerProvider() - otel.SetTracerProvider(tp) + tracer := tp.Tracer("test") t.Cleanup(func() { _ = tp.Shutdown(context.Background()) - otel.SetTracerProvider(prev) }) - parent, parentSpan := telemetry.Tracer().Start(context.Background(), "parent") + parent, parentSpan := tracer.Start(context.Background(), "parent") @@ - _, childSpan := telemetry.Tracer().Start(detached, "child") + _, childSpan := tracer.Start(detached, "child")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@telemetry/detach_test.go` around lines 44 - 56, The test currently mutates the global OTel provider and then calls telemetry.Tracer(); instead, use the local test provider's tracer to avoid relying on global state: replace calls to telemetry.Tracer().Start(...) with tp.Tracer("test").Start(...) for creating the parent and child spans (keeping telemetry.DetachContext(parent) and the existing cleanup/shutdown logic), so the test uses the local tracer from tp and still asserts parent/child TraceID equality.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@search/search_test.go`:
- Around line 744-750: The test currently blocks indefinitely waiting on the
mockEmbedding Search hook by doing a plain receive from processSearchEntered
(created alongside processSearchProceed) inside the mock Run callback; change
the wait to use a bounded wait (for example, replace the direct
<-processSearchEntered with a select that listens for processSearchEntered and a
time.After timeout, or use require.Eventually/require.Eventuallyf to assert the
mock was entered) so the test fails fast instead of hanging; update the
corresponding waits around the mockEmbedding.On("Search", ...) Run closure and
any places that read processSearchEntered/processSearchProceed to use this
timeout-based pattern.
In `@search/search.go`:
- Around line 512-513: The fallback direct LLM call
(bot.LLM().ChatCompletionNoStream(ctx, prompt)) can still invoke tools; change
the call to pass the tools-disabled option so tools are disabled in the
non-conversation path (e.g., call ChatCompletionNoStream with
llm.WithToolsDisabled()). Update the invocation in the fallback path where
ChatCompletionNoStream is used and ensure the same option is applied as in the
conversation-backed paths so behavior no longer depends on conversationService
wiring.
- Around line 299-302: The OpenTelemetry spans started in processSearch and
SearchQuery need to record and mark errors on any failing execution paths:
locate the span variable in processSearch and in SearchQuery and on every error
return (including deferred errors or early returns) call span.RecordError(err)
and span.SetStatus(codes.Error, err.Error()) before returning; ensure the codes
symbol from the OpenTelemetry package is imported if not already, and add these
calls consistently for all error branches so failed searches are visible in
traces.
---
Nitpick comments:
In `@telemetry/detach_test.go`:
- Around line 44-56: The test currently mutates the global OTel provider and
then calls telemetry.Tracer(); instead, use the local test provider's tracer to
avoid relying on global state: replace calls to telemetry.Tracer().Start(...)
with tp.Tracer("test").Start(...) for creating the parent and child spans
(keeping telemetry.DetachContext(parent) and the existing cleanup/shutdown
logic), so the test uses the local tracer from tp and still asserts parent/child
TraceID equality.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 0abb2ac8-2de1-467d-aecb-8eb17a452e7d
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (10)
api/api_channel.goapi/api_post.godocs/admin_guide.mde2e/tests/system-console/debug-panel.spec.tsgo.modsearch/search.gosearch/search_conversation_test.gosearch/search_test.gotelemetry/detach_test.gotelemetry/telemetry.go
✅ Files skipped from review due to trivial changes (2)
- docs/admin_guide.md
- e2e/tests/system-console/debug-panel.spec.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- telemetry/telemetry.go
- search/search_conversation_test.go
- api/api_channel.go
There was a problem hiding this comment.
Stale comment
Security Review: OpenTelemetry Tracing
No new medium, high, or critical security vulnerabilities identified in this PR.
What was reviewed
This PR adds OpenTelemetry distributed tracing infrastructure to the Agents plugin, threading
context.Contextthrough the LLM call chain, adding span instrumentation at major boundaries (HTTP, LLM calls, tool execution, MCP, search, web search, streaming), and replacing the legacyEnableLLMTracefeature with structured OTel exports.Security-relevant aspects examined
Span attributes contain only operational metadata: Exported attributes are limited to entity IDs (
ai.user.id,ai.channel.id,ai.post.id), provider/model identifiers, operation types, token counts, tool names, and MCP server names. No post content, LLM prompts, system instructions, tool arguments, tool results, API keys, or user-generated content is included in span data.Legacy
EnableLLMTraceremoval is a security improvement: The removedLanguageModelLogWrapperandTraceResolved/TraceUnknownmethods logged full conversation data including prompts, tool arguments, and tool results to plugin logs. The OTel replacement exports only structured metadata — this eliminates a data exposure vector.Error messages are sanitized: All error paths in the plugin's own spans (
bifrost.gostreamChat/streamResponses) pass throughSanitizeProviderError()which strips the API key before recording on spans. The Bifrost tracer adapter receives error messages from Bifrost's internal callbacks (PopulateLLMResponseAttributes,EndSpan) without explicit sanitization, but these messages originate from LLM provider API responses which do not echo API keys, and the trace data is sent to admin-controlled infrastructure.
DetachContextcorrectly handles async streaming: The newtelemetry.DetachContext()helper preserves the OTel span while breaking the cancellation chain. This replaces the previouscontext.Background()usage and is properly applied at all threeStreamToNewDMcall sites and the asyncprocessSearchgoroutine.Auth/permission model unchanged: All HTTP handlers continue to authenticate via the server-injected
Mattermost-User-Idheader. No new endpoints, data access paths, or permission checks were introduced. Thecontext.Contextplumbing is purely for trace propagation, not for carrying auth or permission state.
OpenTelemetryEndpointis admin-only configuration: The endpoint field is exposed only in the System Console Debug panel to system admins. The gRPC exporter makes outbound connections using the OTLP protocol (not arbitrary HTTP), sending only trace protobuf data. This is standard admin infrastructure configuration, not an SSRF vector.No hardcoded credentials: No API keys, tokens, secrets, or placeholder credentials in the diff.
resource.WithHost()includes hostname: The OTel resource includes the server hostname. This is standard OTel practice and the data goes to admin-configured infrastructure only.
WithInsecure()gRPC transport: Documented in the admin guide with guidance to route through an OTel Collector for TLS. Since this is an admin-only config field with an explicit help text note, this is an intentional operational tradeoff.Sent by Cursor Automation: Find vulnerabilities
There was a problem hiding this comment.
Stale comment
Security Review: OpenTelemetry Tracing
No new medium, high, or critical security vulnerabilities identified in this PR.
All CI checks pass (31/33 passed, 2 skipped/cancelled, 0 failed).
What was reviewed
This PR adds OpenTelemetry distributed tracing infrastructure to the Agents plugin, threading
context.Contextthrough the LLM call chain, adding span instrumentation at major boundaries (HTTP, LLM calls, tool execution, MCP, search, web search, streaming), and replacing the legacyEnableLLMTracefeature with structured OTel exports.Security-relevant aspects examined
Span attributes contain only operational metadata: Exported attributes are limited to entity IDs (
ai.user.id,ai.channel.id,ai.post.id), provider/model identifiers, operation types, token counts, tool names, and MCP server names. No post content, LLM prompts, system instructions, tool arguments, tool results, API keys, or user-generated content is included in span data.Legacy
EnableLLMTraceremoval is a security improvement: The removedLanguageModelLogWrapperandTraceResolved/TraceUnknownmethods logged full conversation data including prompts, tool arguments, and tool results to plugin logs. The OTel replacement exports only structured span metadata — this eliminates a data exposure vector.Error messages are sanitized: All error paths in the plugin's own spans (
bifrost.gostreamChat/streamResponses) pass throughSanitizeProviderError()which strips the API key before recording on spans. The Bifrost tracer adapter receives error messages from Bifrost's internal callbacks (PopulateLLMResponseAttributes,EndSpan) without explicit sanitization, but these messages originate from LLM provider API responses which do not echo API keys.
DetachContextcorrectly handles async streaming: Thetelemetry.DetachContext()helper preserves the OTel span while breaking the cancellation chain. This replaces the previouscontext.Background()usage and is properly applied at allStreamToNewDMcall sites and the asyncprocessSearchgoroutine.Auth/permission model unchanged: All HTTP handlers continue to authenticate via the server-injected
Mattermost-User-Idheader. No new endpoints, data access paths, or permission checks were introduced. Thecontext.Contextplumbing is purely for trace propagation, not for carrying auth or permission state.
OpenTelemetryEndpointis admin-only configuration: The endpoint field is exposed only in the System Console Debug panel to system admins. The gRPC exporter uses the OTLP protocol, sending only trace protobuf data. This is standard admin infrastructure configuration.No hardcoded credentials: No API keys, tokens, secrets, or placeholder credentials in the diff.
WithInsecure()gRPC transport: Documented in the admin guide with guidance to route through an OTel Collector for TLS. Since this is an admin-only config field with explicit help text, this is an intentional operational tradeoff, not a vulnerability.
resource.WithHost()includes hostname: Standard OTel practice; data goes to admin-configured infrastructure only.Sent by Cursor Automation: Find vulnerabilities
nickmisasi
left a comment
There was a problem hiding this comment.
Overall looking good. A couple comments and suggestions.
One question that I may have missed the answer too in code: do we have the right span ID propogation/rehydration in order to allow us to correlate sessions across nodes?
I'm thinking through an example of an invocation of an agent that calls a tool (all one node) but that tool needs acceptance, and the acceptance gets routed to a different node.
| > | ||
| <ItemList> | ||
| <BooleanItem | ||
| label={intl.formatMessage({defaultMessage: 'Enable LLM Trace'})} |
There was a problem hiding this comment.
I'm worried about removing this - not all deployments will have access to something like Tempo or Jaeger, and this might be our only debug surface to identify issues
There was a problem hiding this comment.
Added a console log output option for the tracing.
| BotName = attribute.Key("ai.bot.name") | ||
| BotID = attribute.Key("ai.bot.id") |
There was a problem hiding this comment.
nit: should probably start using agent now since we plan to move to that
| trace.WithAttributes( | ||
| telemetry.PostID.String(post.Id), | ||
| telemetry.ChannelID.String(post.ChannelId), | ||
| ), |
There was a problem hiding this comment.
If we add the root post ID here, we'll be able to correlate spans that are created throughout a thread
There was a problem hiding this comment.
Added, I think there is a lot more stuff to add. Need some follow up PRs to refine this into something really useful.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
conversation/service.go (1)
570-575:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t use
Background()for title generation calls.Line 587 severs parent trace and cancellation by starting from
stdcontext.Background()instead of the caller’s context.Targeted fix
func (s *Service) GenerateTitle( + ctx stdcontext.Context, conversationID string, lm llm.LanguageModel, userMessage string, context *llm.Context, ) error { @@ - title, err := lm.ChatCompletionNoStream(stdcontext.Background(), req, + title, err := lm.ChatCompletionNoStream(ctx, req, llm.WithMaxGeneratedTokens(25), llm.WithReasoningDisabled(), llm.WithToolsDisabled(), )As per coding guidelines, "All functions in the request pipeline must accept
ctx context.Contextas the first parameter and propagate it from entry points ... through to LLM calls and external services for OpenTelemetry tracing."Also applies to: 587-591
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@conversation/service.go` around lines 570 - 575, The GenerateTitle method on type Service currently severs tracing/cancellation by using stdcontext.Background(); change its signature to accept ctx context.Context as the first parameter (func (s *Service) GenerateTitle(ctx context.Context, conversationID string, ...)), update all callers to pass their context, replace any stdcontext.Background() usages inside GenerateTitle with the incoming ctx, and ensure that ctx is propagated into all LLM/external calls (e.g., the llm.LanguageModel call and any calls that currently receive stdcontext.Background()) so OpenTelemetry tracing and cancellation flow end-to-end.
🧹 Nitpick comments (2)
conversations/tool_approval.go (1)
128-142: ⚡ Quick winPer-tool resolution spans provide good observability.
Each tool resolution gets its own span with name, ID, and status attributes. The status is correctly set before
span.End()is called.Consider also calling
span.RecordError(resolveErr)andspan.SetStatus(codes.Error, ...)when an error occurs, per the coding guidelines for consistent error recording. Currently only Bifrost spans record errors this way.Proposed enhancement for error recording
if resolveErr != nil { toolSpan.SetAttributes(telemetry.ToolStatus.String("error")) + toolSpan.RecordError(resolveErr) + toolSpan.SetStatus(codes.Error, resolveErr.Error()) } else { toolSpan.SetAttributes(telemetry.ToolStatus.String("success")) }This requires importing
"go.opentelemetry.io/otel/codes".As per coding guidelines: "Record errors with
span.RecordError(err)andspan.SetStatus(codes.Error, msg)."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@conversations/tool_approval.go` around lines 128 - 142, The span for per-tool resolution (started as toolCtx, toolSpan via telemetry.Tracer().Start in the resolve tool block) currently sets only a status attribute on error; update the error branch where resolveErr != nil to call toolSpan.RecordError(resolveErr) and toolSpan.SetStatus(codes.Error, resolveErr.Error()) before toolSpan.End(), and add the necessary import for "go.opentelemetry.io/otel/codes"; leave the success branch unchanged.telemetry/turn_id.go (1)
103-121: 💤 Low valueConsider logging or bounding the retry loop for
rand.Readfailures.The infinite loops assume
rand.Readwill eventually produce valid (non-zero) IDs. While cryptographically this is near-certain, silently discarding the error fromrand.Readmeans a broken entropy source would cause an infinite loop with no diagnostic output. This is extremely unlikely in practice, so this is a minor defensive suggestion.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@telemetry/turn_id.go` around lines 103 - 121, The NewIDs method on randomIDGenerator should stop silently looping on rand.Read failures: check the error returned by rand.Read when filling trace.TraceID and trace.SpanID, add a bounded retry (e.g. a maxAttempts constant) for each ID generation loop, and on repeated failures log the error and return a safe fallback (e.g. zero IDs or propagate a higher-level error depending on surrounding code) instead of looping forever; update the loops in randomIDGenerator.NewIDs to increment attempts, handle non-nil err from rand.Read, call your logger (or panic if appropriate for your app) after exhausting attempts, and then return the fallback IDs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@conversation/service.go`:
- Around line 152-153: The method Service.GetInitiatingUserTurn (and the other
new request-path method around lines 180-181) must accept ctx context.Context as
the first parameter and propagate it to downstream calls; update the function
signatures to func (s *Service) GetInitiatingUserTurn(ctx context.Context,
conversationID, postID string) and any sibling method similarly, and pass ctx
into s.store.GetTurnsForConversation (and other store/handler calls) as well as
update all call sites to forward their ctx so OpenTelemetry trace context is
preserved.
---
Outside diff comments:
In `@conversation/service.go`:
- Around line 570-575: The GenerateTitle method on type Service currently severs
tracing/cancellation by using stdcontext.Background(); change its signature to
accept ctx context.Context as the first parameter (func (s *Service)
GenerateTitle(ctx context.Context, conversationID string, ...)), update all
callers to pass their context, replace any stdcontext.Background() usages inside
GenerateTitle with the incoming ctx, and ensure that ctx is propagated into all
LLM/external calls (e.g., the llm.LanguageModel call and any calls that
currently receive stdcontext.Background()) so OpenTelemetry tracing and
cancellation flow end-to-end.
---
Nitpick comments:
In `@conversations/tool_approval.go`:
- Around line 128-142: The span for per-tool resolution (started as toolCtx,
toolSpan via telemetry.Tracer().Start in the resolve tool block) currently sets
only a status attribute on error; update the error branch where resolveErr !=
nil to call toolSpan.RecordError(resolveErr) and toolSpan.SetStatus(codes.Error,
resolveErr.Error()) before toolSpan.End(), and add the necessary import for
"go.opentelemetry.io/otel/codes"; leave the success branch unchanged.
In `@telemetry/turn_id.go`:
- Around line 103-121: The NewIDs method on randomIDGenerator should stop
silently looping on rand.Read failures: check the error returned by rand.Read
when filling trace.TraceID and trace.SpanID, add a bounded retry (e.g. a
maxAttempts constant) for each ID generation loop, and on repeated failures log
the error and return a safe fallback (e.g. zero IDs or propagate a higher-level
error depending on surrounding code) instead of looping forever; update the
loops in randomIDGenerator.NewIDs to increment attempts, handle non-nil err from
rand.Read, call your logger (or panic if appropriate for your app) after
exhausting attempts, and then return the fallback IDs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 01d3ed62-887f-4241-a4c0-6682377a84c9
📒 Files selected for processing (9)
conversation/service.goconversation/service_test.goconversations/conversations.goconversations/handle_messages.goconversations/regeneration.goconversations/tool_approval.gotelemetry/telemetry.gotelemetry/turn_id.gotelemetry/turn_id_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- conversation/service_test.go
There was a problem hiding this comment.
Stale comment
Security Review — OpenTelemetry Tracing
Reviewed all 87 changed files (2,754 additions / 788 deletions) for vulnerabilities matching the security checklist: channel data isolation, identity spoofing, inter-plugin bridge trust, tool/MCP boundaries, prompt injection, file access control, search/embedding scoping, secrets exposure, and standard injection/auth bypass risks.
Summary
No medium, high, or critical security findings. This PR is primarily infrastructure plumbing — adding
context.Contextas a first parameter across the request pipeline and creating OpenTelemetry spans at system boundaries. The changes do not alter authentication, authorization, data access control, or permission-checking logic.What was verified
- No sensitive data in spans: Span attributes are limited to internal IDs (
ai.user.id,ai.channel.id,ai.post.id), operational metadata (provider name, model, token counts, tool name/status), and error messages that are already sanitized viaSanitizeProviderError. Post content, message bodies, prompts, tool arguments, and tool results are not recorded in any span.- Auth/permission logic unchanged: All
userIDvalues continue to originate from the server-injected Mattermost session header viac.Request.Context(). No new endpoints or data-access paths were introduced. Thecontext.Contextadditions are purely for cancellation/tracing propagation.- Removed logging surface is a net improvement: The old
LanguageModelLogWrapper(behindEnableLLMTrace) logged full conversation prompts, tool arguments, and tool results to the plugin log. Its removal eliminates a data-exposure surface. The OTel replacement records only metadata, not content.- OTLP endpoint is admin-only: The
OpenTelemetryEndpointconfig field is behind the System Console admin interface. TheWithInsecure()gRPC option is documented in the admin guide with guidance to use a TLS-terminating collector for production. This is an acceptable admin-configured integration pattern.- No new secrets or credentials: No hardcoded keys, tokens, or credentials in the diff. API keys continue to be sanitized from error messages before recording on spans.
Sent by Cursor Automation: Find vulnerabilities
There was a problem hiding this comment.
Stale comment
Security Review — No Findings
Reviewed the full diff (91 files, ~2,750 additions) for the OpenTelemetry tracing infrastructure PR.
What was checked:
- Span attributes for sensitive data leakage (API keys, message content, file data, credentials)
- Auth/permission changes across all modified call paths
- SSRF potential from the configurable OTLP endpoint
- Data isolation impact of user/channel/post IDs in traces
- Impact of removed logging code (
LanguageModelLogWrapper,TraceResolved/TraceUnknown)- Context propagation changes (
DetachContext,WithTurnID) for cancellation or identity issuesAssessment: No medium, high, or critical security vulnerabilities identified.
- Span attributes are limited to metadata (IDs, tool names, token counts, provider/model). No message bodies, prompts, file contents, or secrets flow into traces.
- Error messages pass through
SanitizeProviderErrorbefore recording on spans, preserving API key scrubbing.- The removed
LanguageModelLogWrapperandTraceResolved/TraceUnknownpreviously logged full prompt text and tool arguments — their removal is a net security improvement.- No auth checks are added, removed, or weakened. Context threading is purely mechanical.
- The OTLP endpoint is admin-only configuration behind system-admin permission checks.
Sent by Cursor Automation: Find vulnerabilities
@nickmisasi We did not, but we do now. |
There was a problem hiding this comment.
Stale comment
Security Review — OpenTelemetry Tracing PR
Reviewed the full diff (93 files, +3133/−790) for security concerns relevant to a multi-user Mattermost plugin environment.
Summary
This PR is primarily observability infrastructure — adding
context.Contextthreading and OpenTelemetry span instrumentation. It does not introduce new data-access paths, permission bypasses, or user-controlled inputs that reach sensitive sinks. The authorization model is unchanged; the config endpoint remains admin-gated (mattermostAdminAuthorizationRequired). Theotelginmiddleware does not capture HTTP headers by default, so the injectedMattermost-User-IDheader is not leaked into spans.Positive security note: The removal of
LanguageModelLogWrapper(which previously logged full conversation prompts viafmt.Sprintf(" %v", request)) is a net improvement — it eliminates a path where user message content was written to server logs.One medium-severity finding identified below regarding unsanitized provider error messages flowing into trace data.
Sent by Cursor Automation: Find vulnerabilities
…y-tracing-138e # Conflicts: # api/api_llm_bridge.go # api/fake_llm_test.go
|
This pull request introduces a critical LLM prompt injection risk in
🔴 LLM Prompt Injection in
|
| Vulnerability | LLM Prompt Injection |
|---|---|
| Description | The user-provided 'Prompt' field from the analysis request is injected directly into the system prompt template (summarize_channel_system.tmpl). While there is an attempt to isolate it by wrapping it in quotes and adding it as an 'Additional instruction', this structure is vulnerable to common prompt injection techniques where a malicious user can escape the quoted instruction block and override the system-defined instructions (e.g., bypassing tool limitations or mandated citation formats). |
mattermost-plugin-agents/api/api_channel.go
Lines 153 to 156 in be8d3c6
Comment to provide feedback on these findings.
Report false positive: @dryrunsecurity fp [FINDING ID] [FEEDBACK]
Report low-impact: @dryrunsecurity nit [FINDING ID] [FEEDBACK]
Example: @dryrunsecurity fp drs_90eda195 This code is not user-facing
All finding details can be found in the DryRun Security Dashboard.
|
@DryRunSecurity fp drs_5632491e This is by design. We don't care about the user overriding instructions it's their server and their LLM. |
|
Finding 5632491e has been successfully dismissed and marked as False Positive. |
| h.span.SetStatus(otelcodes.Error, bErr.Error.Message) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Medium — Unsanitized provider error message recorded on Bifrost child span
bErr.Error.Message is written directly to the span status without passing through llm.SanitizeProviderError. LLM provider error responses can contain API key material (e.g., OpenAI: "Incorrect API key provided: sk-proj-abc...xyz").
This value surfaces in two ways:
- Logs mode —
logSpanProcessor.OnEndoutputsstatus.Descriptionas the"error"log field (log_processor.go:58), making it visible in server logs. - OTLP mode — the span is exported to the configured collector.
The parent-level code in bifrost.go (lines 549, 604, 1547, 1614) correctly applies llm.SanitizeProviderError(err, b.apiKey) before recording errors on its span. This inconsistency means the Bifrost child span leaks what the parent span sanitizes.
Impact: If a provider echoes back API key material in an error response, that key would appear in trace logs or OTLP export, accessible to anyone with server-log or trace-backend access.
Suggested fix: Inject the API key into the otelTracer at construction time (it's available in newBifrostClient) and sanitize bErr.Error.Message here before setting the span status.


Summary
Adds OpenTelemetry tracing infrastructure to the Agents plugin. Spans are placed at major boundaries and flow end-to-end to a backend; attributes are sparse and follow-up work is needed before traces are operationally useful (see "Known gaps" below).
What's in
telemetry/package — OTLP gRPC exporter, helpers, attribute keys. NewEnableOpenTelemetryandOpenTelemetryEndpointconfig fields wired into the plugin lifecycle. No-op when disabled.context.Contextthreading — added as the first parameter tollm.LanguageModel.ChatCompletion/ChatCompletionNoStreamand propagated through all callers.otelgin),bifrost.LLM.ChatCompletion, tool resolution and approval flows, DM/channel/thread processing, search, MCP tool calls, web search,streaming.StreamToPost.bifrost/tracer.go) — implements Bifrost'sschemas.Traceragainst our OTel tracer. Provider calls, retries, fallbacks, and Bifrost plugin hooks now nest inside the surrounding plugin span in a single trace, with Bifrost'sgen_ai.*attributes carried through.dev/docker-compose.otel.yml) — Grafana Tempo (OTLP on:4317) + Grafana on:3001with the Tempo datasource preprovisioned. Anonymous Admin, no login.Known gaps (follow-up PR)
llm chat completionis missing posts count, tools count, system-prompt size, and TTFT (the Bifrost adapter tracks TTFT internally but doesn't surface it on the span).RecordError/SetStatus(codes.Error, ...).Ticket Link
N/A
Screenshots
N/A — System Console adds two controls in the Debug panel (Enable OpenTelemetry boolean, OpenTelemetry Endpoint text field). All other changes are backend.
Release Note
Summary by CodeRabbit
New Features
Configuration
Refactor
Documentation
Dev / Tests