Skip to content

OpenTelemetry tracing#542

Merged
crspeller merged 31 commits into
masterfrom
cursor/opentelemetry-tracing-138e
May 7, 2026
Merged

OpenTelemetry tracing#542
crspeller merged 31 commits into
masterfrom
cursor/opentelemetry-tracing-138e

Conversation

@crspeller
Copy link
Copy Markdown
Member

@crspeller crspeller commented Mar 5, 2026

Summary

Adds OpenTelemetry tracing infrastructure to the Agents plugin. Spans are placed at major boundaries and flow end-to-end to a backend; attributes are sparse and follow-up work is needed before traces are operationally useful (see "Known gaps" below).

What's in
  • telemetry/ package — OTLP gRPC exporter, helpers, attribute keys. New EnableOpenTelemetry and OpenTelemetryEndpoint config fields wired into the plugin lifecycle. No-op when disabled.
  • System Console UI — Enable + endpoint controls in the Debug panel. Removes the legacy "Enable LLM Trace" toggle (replaced by OTel).
  • context.Context threading — added as the first parameter to llm.LanguageModel.ChatCompletion/ChatCompletionNoStream and propagated through all callers.
  • Spans at: HTTP routes (otelgin), bifrost.LLM.ChatCompletion, tool resolution and approval flows, DM/channel/thread processing, search, MCP tool calls, web search, streaming.StreamToPost.
  • Bifrost internal tracer adapter (bifrost/tracer.go) — implements Bifrost's schemas.Tracer against our OTel tracer. Provider calls, retries, fallbacks, and Bifrost plugin hooks now nest inside the surrounding plugin span in a single trace, with Bifrost's gen_ai.* attributes carried through.
  • Local dev stack (dev/docker-compose.otel.yml) — Grafana Tempo (OTLP on :4317) + Grafana on :3001 with the Tempo datasource preprovisioned. Anonymous Admin, no login.
  • Tests — telemetry integration tests (span export, attributes, error recording, parent-child); Bifrost adapter unit tests.
Known gaps (follow-up PR)
  • Most spans only have a name; top-level spans don't record user/bot/conversation/channel/post IDs or operation type.
  • llm chat completion is missing posts count, tools count, system-prompt size, and TTFT (the Bifrost adapter tracks TTFT internally but doesn't surface it on the span).
  • No spans for: prompt formatting, conversation history loading + token-budget truncation, file/attachment fetching, RAG embedding generation.
  • Errors are inconsistently recorded — only Bifrost spans currently call RecordError / SetStatus(codes.Error, ...).

Ticket Link

N/A

Screenshots

N/A — System Console adds two controls in the Debug panel (Enable OpenTelemetry boolean, OpenTelemetry Endpoint text field). All other changes are backend.

Release Note

Added OpenTelemetry tracing infrastructure for the Agents plugin. New "Enable OpenTelemetry" and "OpenTelemetry Endpoint" settings in the System Console Debug panel control distributed tracing of LLM calls, tool execution, search, MCP, and streaming operations. The legacy "Enable LLM Trace" setting was removed (fully replaced by OpenTelemetry).

Summary by CodeRabbit

  • New Features

    • OpenTelemetry tracing for HTTP endpoints, LLM completions, tool resolution, streaming and web searches; new telemetry APIs (Init, Tracer, DetachContext) and span attributes.
  • Configuration

    • Replaced enableLLMTrace with enableOpenTelemetry and openTelemetryEndpoint; UI updated (new OTLP endpoint field); debug toggle removed; token-usage logging controls added.
  • Refactor

    • Many call paths now accept context.Context to enable request-scoped cancellation and tracing.
  • Documentation

    • Added OpenTelemetry/tracing guide and span attribute reference.
  • Dev / Tests

    • Local Grafana Tempo Docker Compose and comprehensive telemetry tests added.

crspeller and others added 10 commits February 19, 2026 08:22
* Integrate Bifrost LLM gateway library for unified provider access

Replace individual LLM provider implementations (OpenAI, Anthropic, Azure,
Bedrock, Cohere, Mistral) with the Bifrost gateway library for chat completions.

Key changes:
- Add new bifrost/ package implementing llm.LanguageModel interface
- BifrostLLM adapter handles message conversion, tool calling, and streaming
- Maps ServiceConfig to Bifrost provider configuration
- Update bots/bots.go to use Bifrost for supported providers
- Keep asage provider as fallback for unsupported types
- Retain openai package for transcription and embeddings (not covered by Bifrost)
- Retain anthropic/bedrock packages for FetchModels and evals

This provides a unified API for multiple LLM providers while maintaining
backwards compatibility with existing configurations.

* Add Bifrost support for embeddings and transcription

Extend Bifrost integration to cover additional LLM capabilities:

Embeddings:
- Add BifrostEmbeddingProvider implementing embeddings.EmbeddingProvider interface
- Support CreateEmbedding and BatchCreateEmbeddings via Bifrost gateway
- Add "bifrost" provider type to embeddings configuration
- Support OpenAI, Azure, Cohere, and Bedrock providers for embeddings

Transcription:
- Add BifrostTranscriber for audio-to-text conversion via Bifrost
- Support VTT format output compatible with existing subtitles parsing
- Configurable model and provider selection

This provides unified access to embedding and transcription APIs through
the Bifrost gateway, allowing users to choose their preferred provider
while maintaining a consistent interface.

* Remove legacy provider packages in favor of Bifrost

This commit completes the migration to using Bifrost as the unified LLM
gateway by:

- Deleting openai/, anthropic/, bedrock/, and asage/ packages
- Removing ASage service type support from configuration
- Updating bots.go to use Bifrost exclusively for LLM and transcription
- Updating search/embeddings.go to use Bifrost for embedding providers
- Fixing type assignment issues in getLLM with interface types
- Using service DefaultModel for transcription (fallback to whisper-1)
- Cleaning up config.go by removing OpenAI-specific helper functions

All LLM providers are now accessed through the Bifrost gateway library,
providing a consistent interface across OpenAI, Anthropic, Azure,
OpenAI-compatible, and Bedrock providers.

* Improve Bifrost configuration to use built-in features

- Use NetworkConfig.BaseURL for OpenAI Compatible and other providers
  that support custom API endpoints (Azure uses AzureKeyConfig instead)
- Pass OpenAI OrgID via NetworkConfig.ExtraHeaders
- Configure built-in retry logic (2 retries with 1s-10s backoff)
- Remove unused httpClient from providerAccount struct
  (Bifrost manages its own HTTP client internally)

* Update Bifrost integration for newer library API

* Responses API

* Fix CI

* Mod tidy

* Add reasoning and annotation streaming to Bifrost responses

* Restore missing functionality in Bifrost migration

- Add auto-run tool resolution wrapper with recursive loop
- Honor ReasoningDisabled per-request config
- Route native web search through Responses API for all providers
- Stream reasoning/thinking events in chat API path
- Restore Anthropic thinking budget auto-calculation
- Preserve thinking blocks in conversation history
- Merge consecutive same-role messages for Anthropic
- Fix transcription model to always use whisper-1
- Pass embedding dimensions in requests
- Remove unused httpClient parameter from Bifrost

* Fix review issues in Bifrost migration

* /v1 issue

* Fix remaining e2e failures: reasoning summaries, mock model, and embedding URL

- Set reasoning.Summary = "auto" for OpenAI Responses API so reasoning
  summary events are emitted in the stream (matches old SDK behavior)
- Add defaultModel to mock service configs (Bifrost requires it)
- Pass APIURL for OpenAI embedding provider for consistency
- Normalize embedding provider URLs to strip /v1 suffix

* Fix some time issues

* Fix e2e flake mode

* Lint

* Annotations fixed

* Fix for Anthropic annotations

---------

Co-authored-by: Claude <noreply@anthropic.com>
Bring agents-v2 up to date with master, incorporating 8 commits:
token accounting, bot config hot-reload, license gating, e2e
observability, model constant updates, MCP OAuth fix, docs, and
RHS prompt hint removal. Resolved conflicts in bots/bots.go
(imports), evals/evals.go (model constants), and accepted deletion
of openai/ files superseded by the bifrost migration.

Made-with: Cursor
Phase 0 of OpenTelemetry integration: clean up observability code that
will be replaced by OTel spans or was never used.

- Remove EnableLLMTrace config field and LanguageModelLogWrapper
- Remove TraceResolved/TraceUnknown from ToolStore, simplify NewToolStore
- Remove dead LLMetrics interface, llmMetrics struct, GetMetricsForAIService,
  IncrementLLMRequests, and unused agents_llm_requests_total counter
- Update webapp config, e2e tests, and all callers

Co-authored-by: christopher <christopher@mattermost.com>
Phase 1: Create telemetry/ package with OTLP gRPC exporter setup,
semantic attribute constants, and TracerProvider lifecycle management.
Add EnableOpenTelemetry and OpenTelemetryEndpoint config fields.
Wire Init/Shutdown into plugin OnActivate/OnDeactivate.

Co-authored-by: christopher <christopher@mattermost.com>
Update all remaining implementations, mocks, and callers of the
LanguageModel interface to accept ctx context.Context as the first
parameter to ChatCompletion and ChatCompletionNoStream.

Implementations updated:
- llm/auto_run_tools_test.go (testLLM, capturingLLM)
- llm/token_tracking_test.go (MockLanguageModel)
- llm/token_tracking_bench_test.go (benchFakeLLM)
- llm/mocks/language_model_mock.go (generated mock)
- mmtools/web_search_test.go (mockLanguageModel)
- api/fake_llm_test.go (FakeLLM)

Production callers updated:
- conversations/conversations.go
- conversations/tool_handling.go
- channels/channels.go
- threads/threads.go
- search/search.go
- react/react.go
- meetings/meeting_summarization.go
- api/api_llm_bridge.go
- mmtools/web_search.go
- evals/graders.go

Test callers and mock expectations updated:
- llm/auto_run_tools_test.go
- llm/token_tracking_test.go
- llm/token_tracking_bench_test.go
- threads/threads_test.go
- react/react_test.go

Uses context.Background() for callers that don't have a context
available yet. Uses stdcontext import alias where the context
parameter name shadows the context package.

Co-authored-by: christopher <christopher@mattermost.com>
Phase 2b-2d: Replace context.Background() placeholders with real
request-scoped context from HTTP handlers and plugin hooks.
Fix streaming to use passed ctx as parent instead of context.Background().

Co-authored-by: christopher <christopher@mattermost.com>
Phase 3a: Add otelgin.Middleware to Gin router for automatic HTTP span
creation with method, route, and status code.

Phase 3b: Instrument bifrost ChatCompletion with spans that capture
provider, model, operation, token usage, and errors. Errors are recorded
on spans with proper status codes.

Co-authored-by: christopher <christopher@mattermost.com>
Co-authored-by: christopher <christopher@mattermost.com>
Phase 4: Add comprehensive tests for telemetry Init/Shutdown, Tracer,
SpanFromContext, WithLLMAttributes, span hierarchy, and context
propagation. Fix schema URL mismatch between semconv versions by using
resource.New instead of resource.Merge.

Co-authored-by: christopher <christopher@mattermost.com>
Phase 5: Add dev/docker-compose.otel.yml with Jaeger all-in-one
container for local development. Run with:
  docker compose -f dev/docker-compose.otel.yml up -d
Then browse http://localhost:16686 for the Jaeger UI.

Configure plugin with EnableOpenTelemetry=true and
OpenTelemetryEndpoint=localhost:4317 to start exporting traces.

Co-authored-by: christopher <christopher@mattermost.com>
@cursor
Copy link
Copy Markdown
Contributor

cursor Bot commented Mar 5, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 5, 2026

🤖 LLM Evaluation Results

OpenAI

⚠️ Overall: 21/28 tests passed (75.0%)

Provider Total Passed Failed Pass Rate
⚠️ OPENAI 28 21 7 75.0%

❌ Failed Evaluations

Show 7 failures

OPENAI

1. TestReactEval/[openai]_react_cat_message

  • Score: 0.00
  • Rubric: The word/emoji is a cat emoji or a heart/love emoji
  • Reason: The output is the text token "heart_eyes_cat", not an actual cat emoji (e.g., 😺/🐱) or a heart/love emoji (e.g., ❤️/😍).

2. TestConversationMentionHandling/[openai]_conversation_from_attribution_long_thread.json

  • Score: 0.00
  • Rubric: is a list of bugs
  • Reason: The output does not provide any actual bug list; it requests the user to paste bug reports and offers a template. Therefore it is not a list of bugs.

3. TestConversationMentionHandling/[openai]_conversation_from_attribution_long_thread.json

  • Score: 0.00
  • Rubric: includes a description of each bug
  • Reason: The output asks the user to paste bug reports and provides a template, but it does not actually include any bug descriptions itself.

4. TestConversationMentionHandling/[openai]_conversation_from_attribution_long_thread.json

  • Score: 0.00
  • Rubric: attributes each bug to a user
  • Reason: The output asks the user to provide bug reports and proposes including a 'Reported by' field, but it does not actually attribute any bug to a user because no bugs are listed or mapped to reporters.

5. TestConversationMentionHandling/[openai]_conversation_from_attribution_long_thread.json

  • Score: 0.00
  • Rubric: attributes the bug about trying to save without a color and the save button not doing anything to @maria.nunez
  • Reason: The output does not mention the specific bug about saving without a color or the save button doing nothing, and it does not attribute any bug to @maria.nunez. It only asks the user to paste bug reports and provides a template.

6. TestConversationMentionHandling/[openai]_conversation_from_attribution_long_thread.json

  • Score: 0.00
  • Rubric: the bug about the end user being able to change channel banner is attributed to @maria.nunez
  • Reason: The output does not mention the specific bug about an end user being able to change the channel banner, nor does it attribute that bug to @maria.nunez. It only asks the user to provide bug reports and offers a template.

7. TestDirectMessageConversations/[openai]_bot_dm_tool_introspection

  • Score: 0.00
  • Rubric: mentions Github and refers to the documentation
  • Reason: The output refers to Mattermost documentation but does not mention GitHub anywhere. Since the rubric requires both mentioning GitHub and referring to the documentation, it fails.

Anthropic

⚠️ Overall: 21/28 tests passed (75.0%)

Provider Total Passed Failed Pass Rate
⚠️ ANTHROPIC 28 21 7 75.0%

❌ Failed Evaluations

Show 7 failures

ANTHROPIC

1. TestReactEval/[anthropic]_react_cat_message

  • Score: 0.00
  • Rubric: The word/emoji is a cat emoji or a heart/love emoji
  • Reason: The output is the text "heart_eyes_cat", not an actual cat emoji (e.g., 😺/🐱) or a heart/love emoji (e.g., ❤️/😍).

2. TestConversationMentionHandling/[anthropic]_conversation_from_attribution_long_thread.json

  • Score: 0.00
  • Rubric: is a list of bugs
  • Reason: The output does not provide any bugs or a list of bugs. It states inability to access trackers and suggests steps to find bugs, but contains no actual bug entries.

3. TestConversationMentionHandling/[anthropic]_conversation_from_attribution_long_thread.json

  • Score: 0.00
  • Rubric: includes a description of each bug
  • Reason: The output states it cannot access bug trackers and suggests steps to find bugs, but it does not actually describe any bugs. Therefore it does not include a description of each bug.

4. TestConversationMentionHandling/[anthropic]_conversation_from_attribution_long_thread.json

  • Score: 0.00
  • Rubric: attributes each bug to a user
  • Reason: The output does not list any bugs and therefore does not attribute each bug to a user. It only states it cannot access trackers and suggests steps to gather data.

5. TestConversationMentionHandling/[anthropic]_conversation_from_attribution_long_thread.json

  • Score: 0.00
  • Rubric: attributes the bug about trying to save without a color and the save button not doing anything to @maria.nunez
  • Reason: The output states it cannot access trackers and suggests steps to find bugs, but it does not attribute the specific bug (saving without a color / save button does nothing) to @maria.nunez.

6. TestConversationMentionHandling/[anthropic]_conversation_from_attribution_long_thread.json

  • Score: 0.00
  • Rubric: the bug about the end user being able to change channel banner is attributed to @maria.nunez
  • Reason: The output does not mention the specific bug about an end user being able to change the channel banner, nor does it attribute that bug to @maria.nunez.

7. TestDirectMessageConversations/[anthropic]_bot_dm_tool_introspection

  • Score: 0.50
  • Rubric: mentions Github and refers to the documentation
  • Reason: The output refers to documentation (docs.mattermost.com) but does not mention GitHub anywhere, so it fails the rubric requirement to mention both GitHub and the documentation.

This comment was automatically generated by the eval CI pipeline.

cursoragent and others added 3 commits March 6, 2026 07:39
Add 6 integration tests verifying end-to-end span export:
- LLM completion spans with attributes and token usage
- Error spans with proper status codes and exception events
- Tool resolution spans (success and unknown tool error)
- Parent-child span hierarchy (HTTP -> LLM)
- Full request trace (HTTP -> process -> LLM -> tool)

Fix ResolveTool to record errors on spans for both unknown tools
and resolver failures.

Co-authored-by: christopher <christopher@mattermost.com>
- CLAUDE.md: Add tracing section covering telemetry package, context
  threading, span instrumentation patterns, config, local testing,
  and the stdcontext aliasing convention.
- docs/admin_guide.md: Add comprehensive OpenTelemetry section with
  what gets traced, how to enable, local Jaeger setup, production
  deployment guidance, and custom attribute reference table.

Co-authored-by: christopher <christopher@mattermost.com>
Co-authored-by: christopher <christopher@mattermost.com>
@crspeller crspeller changed the base branch from opentelemetry to master March 25, 2026 12:26
cursoragent and others added 9 commits March 25, 2026 14:30
Resolve conflicts from 24 master commits including Bifrost migration,
structured output support, MCP stateless sessions, response placeholder
pattern, automated invoker detection, and e2e CI sharding.

Key merge decisions:
- Keep new StructuredOutputFallbackWrapper with ctx parameter
- Adopt master's respondToPost/createResponsePlaceholder pattern with ctx
- Keep isAutomatedInvoker/computeAllowToolsInChannel from master
- Update new LanguageModel implementations (structured output wrapper)
  to match interface with context.Context parameter
- Thread ctx through new streamResponseToExistingPost and respondToPost

Co-authored-by: christopher <christopher@mattermost.com>
Resolve conflicts from 5 master commits (per-tool config, dependency
updates, Anthropic fix, Entry SKU, search fix).

New code updated for OpenTelemetry:
- Add ctx to wrapStreamWithMCPAutoApproval and pass to ResolveTool
- Add ctx to AutoExecuteApprovedToolCalls -> HandleToolCall call
- Add ctx to capturingLanguageModel test mock
- Fix NewToolStore() calls in new tests (old 2-arg signature removed)
- Fix ToolAutoRunKey usage in auto_run_tools_test with ctx param

Co-authored-by: christopher <christopher@mattermost.com>
Resolve conflicts from 17 master commits including:
- Go module rename from mattermost-plugin-ai to mattermost-plugin-agents
- Config migration from config.json to database
- Auto-run everywhere tool policy
- Channel tools auto-run restriction for bot invocations
- LLM proxy support
- Development environment setup
- Removed built-in GitHub/Jira tools
- Tool failure recovery in agent loop
- License header updates
- Documentation additions

All import paths updated to mattermost-plugin-agents. OTel tracing
(ctx params, telemetry spans, telemetry imports) preserved throughout.

Co-authored-by: christopher <christopher@mattermost.com>
Replace the legacy Enable LLM Trace toggle in the Debug panel with:
- Enable OpenTelemetry (boolean toggle)
- OpenTelemetry Endpoint (text field, disabled when OTel is off)

EnableLLMTrace is fully redundant: our OTel spans capture LLM calls,
tool execution, errors, and token usage in a structured, queryable way
via any OTLP backend (Jaeger, Grafana Tempo, etc.), replacing the old
approach of dumping full conversation data to plugin logs.

Also clean up stale EnableLLMTrace references that were re-introduced
during master merges in test mocks, store tests, and e2e helpers.

Co-authored-by: christopher <christopher@mattermost.com>
Add context.Context to evalStreamLogger.ChatCompletion and
ChatCompletionNoStream, fix NewToolStore() call signature, and
add context.Background() to ChatCompletion call in tools_eval_test.

Co-authored-by: christopher <christopher@mattermost.com>
Co-authored-by: christopher <christopher@mattermost.com>
@crspeller crspeller marked this pull request as ready for review May 5, 2026 14:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7d5b31d392

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread api/api_post.go Outdated
// Create analysis post with conversation ID
analysisPost := a.makeAnalysisPost(user.Locale, post.Id, data.AnalysisType, analyzeResult.ConversationID)
if err := a.streamingService.StreamToNewDM(stdcontext.Background(), botUserID, analyzeResult.Stream, user.Id, analysisPost, post.Id); err != nil {
if err := a.streamingService.StreamToNewDM(c.Request.Context(), botUserID, analyzeResult.Stream, user.Id, analysisPost, post.Id); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Detach request cancellation before starting DM stream

Passing c.Request.Context() into StreamToNewDM causes the stream context to be canceled as soon as the HTTP handler returns, because StreamToNewDM starts streaming in a goroutine from a child context (GetStreamingContext uses context.WithCancel(inCtx)), and StreamToPost exits on <-ctx.Done(). In this flow the analysis response can be truncated or canceled immediately after the API call completes, so users may get partial/empty DM outputs even when generation succeeded.

Useful? React with 👍 / 👎.

Comment thread server/main.go Outdated
Comment on lines +516 to +517
if cfg.EnableOpenTelemetry {
shutdown, telErr := telemetry.Init(context.Background(), "mattermost-ai-agents", manifest.Version, cfg.OpenTelemetryEndpoint)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reinitialize telemetry when OpenTelemetry config changes

Telemetry initialization is only executed during OnActivate, so changing enableOpenTelemetry or openTelemetryEndpoint at runtime will not apply until the plugin is deactivated/reactivated. Because config updates in this plugin are normally propagated via registered update listeners, this makes the new System Console controls effectively stale after save and can leave tracing disabled (or pointed at an old endpoint) until a manual restart.

Useful? React with 👍 / 👎.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds OpenTelemetry tracing and request-context propagation across the plugin: new telemetry package and attributes, OTel init/shutdown, Gin middleware, instrumented spans for LLM/Bifrost, tool resolution, MCP, web searches and streaming; many APIs and wrappers accept context.Context; config/UI replaced LLM-trace toggle with OpenTelemetry settings; local Grafana/Tempo dev stack and telemetry tests added.

Changes

Telemetry + Context Propagation (single cohesive DAG)

Layer / File(s) Summary
Data Shape / Public API
llm/language_model.go, llm/tools.go, threads/threads.go, conversations/conversations.go, toolrunner/toolrunner.go
LanguageModel methods and many service/handler APIs gain a leading context.Context parameter; ToolStore.ResolveTool and ToolRunner.Run accept ctx; removed GetEnableLLMTrace() from config/provider interfaces; added DMConversationResult.UserTurnID.
Core Telemetry Foundation
telemetry/telemetry.go, telemetry/attributes.go, telemetry/turn_id.go
New telemetry package: Init/Shutdown, Tracer(), SpanFromContext(), DetachContext(), WithLLMAttributes; exported attribute keys; deterministic turn-ID based SpanContext/ID generator and helper WithTurnID.
OTel Instrumentation Implementation
bifrost/tracer.go, bifrost/bifrost.go, bifrost/logger.go, mcp/client.go
Added OTel-backed Bifrost tracer adapter with deferred-span/stream-accumulator logic; threaded ctx into ChatCompletion/streaming; record token counts/errors; wired tracer into Bifrost client; MCP calls now start spans and record errors.
Tooling & Runner Integration
toolrunner/toolrunner.go, llm/tools.go
ToolRunner.Run now takes ctx and propagates it; ToolStore.ResolveTool accepts ctx,llmCtx, creates "resolve tool" spans, records errors and auth errors store; removed old trace-log hooks; constructors simplified.
Language Model Wrappers & Mocks
llm/* (truncation, token_tracking, structured_output_fallback, logging, mocks, token tests)
Updated wrappers and generated mocks to accept/forward context.Context; removed LanguageModelLogWrapper; token-usage and truncation wrappers forward ctx; tests/mocks updated.
HTTP & Handler Wiring
api/api.go, api/*, conversations/*, conversations/handle_messages.go, search/search.go, streaming/streaming.go, react/react.go, server/main.go
Registered Gin OTel middleware; propagated request contexts into analyzers, LLM calls, tool runs and streaming; added spans around request handlers and search async work; Plugin OnActivate/OnDeactivate initialize/shutdown telemetry.
Search / Websearch / Providers
search/*, websearch/google.go, websearch/brave.go
Search async processing now detaches/propagates ctx and starts spans; web search providers instrumented with spans and error recording.
Configuration, UI & Dev Stack
config/config.go, webapp/src/components/system_console/*, dev/docker-compose.otel.yml, dev/tempo/*, dev/grafana/*, docs/admin_guide.md, CLAUDE.md
Replaced EnableLLMTrace with EnableOpenTelemetry + OpenTelemetryEndpoint; UI updated; added local Grafana+Tempo compose/provisioning and documentation describing tracing.
Tests / Integration
telemetry/*, many *_test.go
Added telemetry unit/integration tests (attributes, detach, turn ID generator, integration traces); updated many tests and test doubles to new context-aware signatures and added async span coverage tests.
Dependency updates
go.mod
Added OpenTelemetry dependencies (otelgin, otlptrace, SDK, API) and bumped several indirect modules.

Sequence Diagram(s)

sequenceDiagram
    participant Client as HTTP Client
    participant Gin as Gin API
    participant Telemetry as OTel Tracer
    participant Service as Plugin Service
    participant ToolRunner as ToolRunner
    participant LLM as Bifrost/LLM
    participant Exporter as OTLP Exporter

    Client->>Gin: HTTP request
    Gin->>Telemetry: Start request span
    Gin->>Service: Handler(ctx)
    Service->>Telemetry: Start operation span
    Service->>ToolRunner: Run(ctx, ...)
    ToolRunner->>Telemetry: Start "resolve tool" span
    ToolRunner->>LLM: ChatCompletion(ctx, req)
    LLM->>Telemetry: Start "llm chat completion" span
    LLM->>Telemetry: Add attributes/events (tokens, errors)
    Telemetry->>Exporter: Export spans
    LLM-->>Service: Stream/result
    Service-->>Gin: Response
    Gin-->>Client: HTTP response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cursor/opentelemetry-tracing-138e

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
websearch/google.go (1)

93-100: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Missing error recording on span for additional failure paths.

The span only records errors for the HTTP request failure (lines 87-88), but two other error paths return errors without updating the span status:

  • Non-OK HTTP status (line 94)
  • JSON decode failure (line 99)

This creates observability gaps where the span appears successful despite the operation failing.

🔧 Proposed fix to record all errors on the span
 	if resp.StatusCode != http.StatusOK {
-		return nil, fmt.Errorf("web search request failed: status %s", resp.Status)
+		err := fmt.Errorf("web search request failed: status %s", resp.Status)
+		span.RecordError(err)
+		span.SetStatus(codes.Error, err.Error())
+		return nil, err
 	}

 	var payload googleSearchResponse
 	if err := json.NewDecoder(resp.Body).Decode(&payload); err != nil {
-		return nil, fmt.Errorf("failed to decode web search response: %w", err)
+		span.RecordError(err)
+		span.SetStatus(codes.Error, err.Error())
+		return nil, fmt.Errorf("failed to decode web search response: %w", err)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@websearch/google.go` around lines 93 - 100, The span isn't updated on the
non-OK HTTP status and JSON decode failure paths; update the span before
returning error in both cases by recording the error (span.RecordError(...)) and
setting the span status to error (span.SetStatus(codes.Error, "web search
failed" or the specific message)) for the non-OK response check around resp and
for the JSON decode error when decoding into payload; use the same span variable
used earlier and the otel codes package (codes.Error) so traces reflect these
failures.
search/search.go (1)

268-291: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Span ends before async work completes.

The span created at line 268 ends when RunSearch returns (via defer span.End() at line 269), but processSearch continues executing asynchronously on line 291. This means:

  1. The "run search" span will be recorded as complete before the actual search work finishes.
  2. Any child spans created within processSearch (e.g., from ChatCompletion) will either be orphaned or have incorrect timing relationships.

Consider either:

  • Starting the span inside processSearch instead, or
  • Using trace.ContextWithSpan to create a detached context for the goroutine that doesn't rely on the parent span's lifecycle.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@search/search.go` around lines 268 - 291, The "run search" span created with
telemetry.Tracer().Start in RunSearch is ended before the asynchronous work in
s.processSearch completes (defer span.End()), so move span lifecycle into the
goroutine: remove the span creation/defer in RunSearch and instead start and end
a new span at the top of s.processSearch (use telemetry.Tracer().Start(ctx, "run
search") inside processSearch), passing the original ctx into s.processSearch as
before; alternatively, if you prefer to keep the parent span in RunSearch,
create a detached context for the goroutine (e.g., use trace.ContextWithSpan or
a new context.Background() and start a fresh span in the goroutine) so the child
spans in s.processSearch are not orphaned when RunSearch returns.
🧹 Nitpick comments (8)
conversation/service_test.go (1)

7-8: 💤 Low value

Redundant import alias.

Both imports refer to the same context package. The stdcontext alias is only needed when context *llm.Context shadows the package name, which doesn't occur in this test file. Consider using just one import.

Suggested fix
 import (
 	"context"
-	stdcontext "context"
 	"encoding/json"

Then update Lines 99 and 103 to use context.Context instead of stdcontext.Context.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@conversation/service_test.go` around lines 7 - 8, Remove the redundant alias
import of the standard context package (stdcontext) and keep only the plain
"context" import; then update any usages that reference stdcontext.Context
(notably in the test functions where stdcontext.Context is used) to use
context.Context instead so the file consistently uses the standard context
package and the duplicate alias is eliminated.
conversation/service.go (1)

518-539: ⚡ Quick win

GenerateTitle should accept context.Context for proper tracing propagation.

Per the coding guidelines, all functions in the request pipeline should accept ctx context.Context as the first parameter. While GenerateTitle is called asynchronously, passing a context would enable cancellation and trace propagation from the caller. Currently, stdcontext.Background() creates an orphaned span that won't link to the parent request trace.

Proposed signature change
 func (s *Service) GenerateTitle(
+	ctx stdcontext.Context,
 	conversationID string,
 	lm llm.LanguageModel,
 	userMessage string,
 	context *llm.Context,
 ) error {
 	request := "Write a short title for the following request. Include only the title and nothing else, no quotations. Request:\n" + userMessage

 	req := llm.CompletionRequest{
 		Posts: []llm.Post{
 			{Role: llm.PostRoleUser, Message: request},
 		},
 		Context:          context,
 		Operation:        llm.OperationTitleGeneration,
 		OperationSubType: llm.SubTypeNoStream,
 	}

-	title, err := lm.ChatCompletionNoStream(stdcontext.Background(), req,
+	title, err := lm.ChatCompletionNoStream(ctx, req,

Callers can then pass a detached context with timeout if needed:

ctx, cancel := stdcontext.WithTimeout(stdcontext.Background(), 30*time.Second)
defer cancel()
go s.GenerateTitle(ctx, convID, lm, msg, llmCtx)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@conversation/service.go` around lines 518 - 539, GenerateTitle currently
creates an orphan context with stdcontext.Background(), breaking
trace/cancellation propagation; change the method signature to func (s *Service)
GenerateTitle(ctx context.Context, conversationID string, lm llm.LanguageModel,
userMessage string, context *llm.Context) error, pass that ctx into the
lm.ChatCompletionNoStream call instead of stdcontext.Background(), and update
all callers to supply an appropriate context (e.g., a parent request ctx or a
timed context) when invoking Service.GenerateTitle so tracing and cancellation
flow correctly.
mcp/client.go (1)

435-448: ⚖️ Poor tradeoff

Incomplete span error recording.

The span records errors for the nil-session case but not for subsequent failures (reconnection errors at Lines 467, 472, 483; tool call errors at Lines 490, 493; and result errors at Lines 512-514, 521). This creates inconsistent trace visibility where some MCP failures appear as errors and others don't.

Consider a helper pattern

To avoid cluttering every error path, wrap the error return:

recordErr := func(err error) error {
    span.RecordError(err)
    span.SetStatus(codes.Error, err.Error())
    return err
}

Then use return "", recordErr(err) at each error return site.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp/client.go` around lines 435 - 448, The span started by
telemetry.Tracer().Start (named via "mcp call tool") records errors only for the
initial nil-session check; add a small helper closure (e.g., recordErr :=
func(err error) error { span.RecordError(err); span.SetStatus(codes.Error,
err.Error()); return err }) immediately after creating ctx, span and use it at
every subsequent error return (reconnection errors around the reconnect calls,
tool call errors from the call to session.Call toolName, and result/parse errors
when handling the returned result) by replacing returns like `return "", err`
with `return "", recordErr(err)` so all failure paths consistently record errors
on the span.
telemetry/telemetry.go (1)

34-40: 💤 Low value

Consider making TLS configuration controllable.

The OTLP exporter is hardcoded to use insecure (unencrypted) gRPC connections. This is fine for local development or internal collectors, but could be a security concern for production deployments where the collector is accessed over untrusted networks.

Consider adding a configuration option to enable TLS, or at minimum document this limitation in the admin guide.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@telemetry/telemetry.go` around lines 34 - 40, The OTLP exporter is forced to
use insecure gRPC via otlptracegrpc.WithInsecure(); add a configurable TLS
option so production can use encrypted connections: extend the telemetry
configuration (e.g., add a TLS boolean and optional cert/CA path or serverName)
and, in the exporter creation logic around otlptracegrpc.New, choose between
otlptracegrpc.WithInsecure() and supplying proper transport credentials (e.g.,
grpc/credentials created from the configured certs or system roots) based on
that config; update the exporter variable creation (exporter, err :=
otlptracegrpc.New(...)) to build the appropriate dial options and make sure
imports and error handling reflect the new branch.
llm/tools.go (1)

322-335: ⚡ Quick win

Add Go doc for the changed exported ToolStore APIs.

NewToolStore and ResolveTool are exported symbols changed in this PR and still have no doc comments.

As per coding guidelines "Document all public APIs".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm/tools.go` around lines 322 - 335, Add missing Go doc comments for the
exported APIs: NewToolStore and ResolveTool (also add docs for AddTools if it is
exported) by placing a descriptive comment immediately above each function: for
NewToolStore describe it returns a pointer to a ToolStore initialized with empty
tools and authErrors; for ResolveTool document the purpose (resolving a tool by
name), explain the parameters (ctx context.Context, name string, argsGetter
ToolArgumentGetter, llmCtx *Context) and the return values (resolved string and
error), and note any side effects or error conditions; ensure comments follow
GoDoc style (start with the function name) so public API is documented.
llm/structured_output_fallback.go (1)

24-29: ⚡ Quick win

Add Go doc for the exported wrapper methods.

ChatCompletion and ChatCompletionNoStream are part of the public wrapper surface and were touched here, but they still have no doc comments.

As per coding guidelines "Document all public APIs".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm/structured_output_fallback.go` around lines 24 - 29, Add Go doc comments
above the exported methods of StructuredOutputFallbackWrapper: provide a short
sentence for ChatCompletion describing that it performs a streaming chat
completion by delegating to the wrapped LanguageModel and returns a
TextStreamResult, and a short sentence for ChatCompletionNoStream describing
that it performs a non-streaming chat completion by delegating to the wrapped
LanguageModel and returns the full string response and error; place the comments
immediately above the ChatCompletion and ChatCompletionNoStream method
declarations using the exact function names.
conversations/handle_messages.go (1)

85-85: ⚡ Quick win

Add GoDoc for exported MessageHasBeenPosted.

This exported method is missing a doc comment, which violates the public API documentation rule.

As per coding guidelines, "Document all public APIs".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@conversations/handle_messages.go` at line 85, Add a GoDoc comment above the
exported method MessageHasBeenPosted describing its purpose and behavior; start
the comment with "MessageHasBeenPosted ..." and mention the receiver
(*Conversations), the parameters (_ *plugin.Context, post *model.Post) and any
important side effects (e.g., how posts are handled or events emitted) and error
conditions if applicable so the public API is documented.
toolrunner/toolrunner.go (1)

315-320: ⚡ Quick win

Set OTel span error status explicitly on resolver failures.

You already add a telemetry.ToolStatus attribute, but failed tool spans should also record the error and set span status so backends/alerts classify them as failures.

Suggested patch
 import (
 	"context"
 	"encoding/json"
 	"fmt"
 	"strings"

 	"github.com/mattermost/mattermost-plugin-agents/llm"
 	"github.com/mattermost/mattermost-plugin-agents/telemetry"
+	"go.opentelemetry.io/otel/codes"
 	"go.opentelemetry.io/otel/trace"
 )
@@
 			if resolveErr != nil {
+				span.RecordError(resolveErr)
+				span.SetStatus(codes.Error, resolveErr.Error())
 				span.SetAttributes(telemetry.ToolStatus.String("error"))
 			} else {
+				span.SetStatus(codes.Ok, "")
 				span.SetAttributes(telemetry.ToolStatus.String("success"))
 			}
 			span.End()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@toolrunner/toolrunner.go` around lines 315 - 320, The span for tool
resolution currently only sets a telemetry.ToolStatus attribute but does not
mark the OpenTelemetry span as an error; when resolveErr != nil, call
span.RecordError(resolveErr) and set the span status to error (e.g.,
span.SetStatus(codes.Error, resolveErr.Error())) before ending the span so
backends classify failures correctly; update the error branch in the ToolRunner
resolver handling (where resolveErr, span, telemetry.ToolStatus are used) to
record the error and set the span status, leaving the success branch to set
success attribute and appropriate OK status if desired.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/admin_guide.md`:
- Line 356: The fenced code block showing the trace example is missing a
language identifier which triggers markdownlint MD040; update the opening fence
(the triple-backticks that precede "HTTP POST /post/:postid/react") to include a
language tag such as text (e.g., ```text) so the block becomes a highlighted
plain-text fenced code block and resolves the MD040 lint warning.

In `@e2e/tests/system-console/debug-panel.spec.ts`:
- Around line 29-33: RADIO_INDICES for enableTokenUsageLogging points at the
wrong radio controls due to an added BooleanItem (Enable OpenTelemetry) shifting
indices; update the RADIO_INDICES constant's enableTokenUsageLogging mapping
(the enableTokenUsageLogging key inside RADIO_INDICES) to use the corrected pair
of indices (increment both current indices by 2 so the true/false values point
to the actual Token Usage Logging radios), or replace the hard-coded numbers
with a small helper that computes offsets based on the preceding BooleanItem
positions.

In `@go.mod`:
- Around line 29-33: The otlptracegrpc dependency in go.mod is pinned to v1.20.0
while the other OpenTelemetry modules (go.opentelemetry.io/otel,
go.opentelemetry.io/otel/sdk, go.opentelemetry.io/otel/trace) are at v1.43.0;
update the version for
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc to v1.43.0 in
go.mod so all OpenTelemetry modules use the same release series and then run `go
mod tidy` to refresh the dependency graph.

In `@llm/language_model.go`:
- Around line 29-32: Add concise method-level comments for the exported
LanguageModel interface and its two public methods: ChatCompletion and
ChatCompletionNoStream. For ChatCompletion, document that it accepts a
context.Context for cancellation/deadlines, returns a TextStreamResult that
provides streaming tokens/updates and may yield partial results, and describe
error semantics when the stream fails or is canceled. For
ChatCompletionNoStream, document that it performs a non-streaming completion,
returns the final string result, respects context cancellation/deadlines, and
describe error behavior for failures. Ensure the comments are brief, placed
immediately above each method signature, and follow the project's public API
comment style.

In `@llm/tools.go`:
- Around line 335-353: ResolveTool currently starts a request span but calls
tool.Resolver without passing the request context, so cancellation/tracing don't
propagate; update the ToolResolver type signature to accept ctx context.Context
as the first parameter (e.g. func(ctx context.Context, llmCtx *Context, args
ToolArgumentGetter) (string, error)) and change all Resolver implementations and
the call site in ToolStore.ResolveTool to pass the incoming ctx through to
tool.Resolver; if the file imports the context package as the identifier
"context" but also defines llm.Context, import the stdlib context package with
the alias stdcontext and use that alias in the new signatures and calls.

In `@search/search_conversation_test.go`:
- Around line 415-416: The mock setup for mockLLM.On("ChatCompletionNoStream",
...) uses only two arguments but the ChatCompletionNoStream signature now takes
a context.Context plus request and options (3+ args); update the expectation in
the test to match by calling mockLLM.On("ChatCompletionNoStream", mock.Anything,
mock.Anything, mock.Anything) (or otherwise include mock.Anything for the ctx
parameter and for the variadic opts) so the mock matches the
ChatCompletionNoStream method signature used elsewhere.

---

Outside diff comments:
In `@search/search.go`:
- Around line 268-291: The "run search" span created with
telemetry.Tracer().Start in RunSearch is ended before the asynchronous work in
s.processSearch completes (defer span.End()), so move span lifecycle into the
goroutine: remove the span creation/defer in RunSearch and instead start and end
a new span at the top of s.processSearch (use telemetry.Tracer().Start(ctx, "run
search") inside processSearch), passing the original ctx into s.processSearch as
before; alternatively, if you prefer to keep the parent span in RunSearch,
create a detached context for the goroutine (e.g., use trace.ContextWithSpan or
a new context.Background() and start a fresh span in the goroutine) so the child
spans in s.processSearch are not orphaned when RunSearch returns.

In `@websearch/google.go`:
- Around line 93-100: The span isn't updated on the non-OK HTTP status and JSON
decode failure paths; update the span before returning error in both cases by
recording the error (span.RecordError(...)) and setting the span status to error
(span.SetStatus(codes.Error, "web search failed" or the specific message)) for
the non-OK response check around resp and for the JSON decode error when
decoding into payload; use the same span variable used earlier and the otel
codes package (codes.Error) so traces reflect these failures.

---

Nitpick comments:
In `@conversation/service_test.go`:
- Around line 7-8: Remove the redundant alias import of the standard context
package (stdcontext) and keep only the plain "context" import; then update any
usages that reference stdcontext.Context (notably in the test functions where
stdcontext.Context is used) to use context.Context instead so the file
consistently uses the standard context package and the duplicate alias is
eliminated.

In `@conversation/service.go`:
- Around line 518-539: GenerateTitle currently creates an orphan context with
stdcontext.Background(), breaking trace/cancellation propagation; change the
method signature to func (s *Service) GenerateTitle(ctx context.Context,
conversationID string, lm llm.LanguageModel, userMessage string, context
*llm.Context) error, pass that ctx into the lm.ChatCompletionNoStream call
instead of stdcontext.Background(), and update all callers to supply an
appropriate context (e.g., a parent request ctx or a timed context) when
invoking Service.GenerateTitle so tracing and cancellation flow correctly.

In `@conversations/handle_messages.go`:
- Line 85: Add a GoDoc comment above the exported method MessageHasBeenPosted
describing its purpose and behavior; start the comment with
"MessageHasBeenPosted ..." and mention the receiver (*Conversations), the
parameters (_ *plugin.Context, post *model.Post) and any important side effects
(e.g., how posts are handled or events emitted) and error conditions if
applicable so the public API is documented.

In `@llm/structured_output_fallback.go`:
- Around line 24-29: Add Go doc comments above the exported methods of
StructuredOutputFallbackWrapper: provide a short sentence for ChatCompletion
describing that it performs a streaming chat completion by delegating to the
wrapped LanguageModel and returns a TextStreamResult, and a short sentence for
ChatCompletionNoStream describing that it performs a non-streaming chat
completion by delegating to the wrapped LanguageModel and returns the full
string response and error; place the comments immediately above the
ChatCompletion and ChatCompletionNoStream method declarations using the exact
function names.

In `@llm/tools.go`:
- Around line 322-335: Add missing Go doc comments for the exported APIs:
NewToolStore and ResolveTool (also add docs for AddTools if it is exported) by
placing a descriptive comment immediately above each function: for NewToolStore
describe it returns a pointer to a ToolStore initialized with empty tools and
authErrors; for ResolveTool document the purpose (resolving a tool by name),
explain the parameters (ctx context.Context, name string, argsGetter
ToolArgumentGetter, llmCtx *Context) and the return values (resolved string and
error), and note any side effects or error conditions; ensure comments follow
GoDoc style (start with the function name) so public API is documented.

In `@mcp/client.go`:
- Around line 435-448: The span started by telemetry.Tracer().Start (named via
"mcp call tool") records errors only for the initial nil-session check; add a
small helper closure (e.g., recordErr := func(err error) error {
span.RecordError(err); span.SetStatus(codes.Error, err.Error()); return err })
immediately after creating ctx, span and use it at every subsequent error return
(reconnection errors around the reconnect calls, tool call errors from the call
to session.Call toolName, and result/parse errors when handling the returned
result) by replacing returns like `return "", err` with `return "",
recordErr(err)` so all failure paths consistently record errors on the span.

In `@telemetry/telemetry.go`:
- Around line 34-40: The OTLP exporter is forced to use insecure gRPC via
otlptracegrpc.WithInsecure(); add a configurable TLS option so production can
use encrypted connections: extend the telemetry configuration (e.g., add a TLS
boolean and optional cert/CA path or serverName) and, in the exporter creation
logic around otlptracegrpc.New, choose between otlptracegrpc.WithInsecure() and
supplying proper transport credentials (e.g., grpc/credentials created from the
configured certs or system roots) based on that config; update the exporter
variable creation (exporter, err := otlptracegrpc.New(...)) to build the
appropriate dial options and make sure imports and error handling reflect the
new branch.

In `@toolrunner/toolrunner.go`:
- Around line 315-320: The span for tool resolution currently only sets a
telemetry.ToolStatus attribute but does not mark the OpenTelemetry span as an
error; when resolveErr != nil, call span.RecordError(resolveErr) and set the
span status to error (e.g., span.SetStatus(codes.Error, resolveErr.Error()))
before ending the span so backends classify failures correctly; update the error
branch in the ToolRunner resolver handling (where resolveErr, span,
telemetry.ToolStatus are used) to record the error and set the span status,
leaving the success branch to set success attribute and appropriate OK status if
desired.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: be5ed64a-0947-4d96-a059-ec54dc9f6aec

📥 Commits

Reviewing files that changed from the base of the PR and between 93b1f2c and 7d5b31d.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (87)
  • CLAUDE.md
  • api/api.go
  • api/api_channel.go
  • api/api_llm_bridge.go
  • api/api_no_tools_test.go
  • api/api_post.go
  • api/fake_llm_test.go
  • bifrost/bifrost.go
  • bifrost/bifrost_test.go
  • bifrost/logger.go
  • bifrost/models.go
  • bifrost/tracer.go
  • bifrost/tracer_test.go
  • bots/bots.go
  • bots/bots_test.go
  • channels/analysis_conversation_test.go
  • channels/channels.go
  • channels/channels_eval_test.go
  • config/config.go
  • conversation/service.go
  • conversation/service_test.go
  • conversations/bot_channel_tool_filter_test.go
  • conversations/conversations.go
  • conversations/conversations_test.go
  • conversations/direct_message_eval_test.go
  • conversations/dm_conversation_test.go
  • conversations/handle_messages.go
  • conversations/handle_messages_test.go
  • conversations/regeneration.go
  • conversations/tool_approval.go
  • conversations/tool_policy_test.go
  • dev/docker-compose.otel.yml
  • dev/grafana/provisioning/datasources/tempo.yaml
  • dev/tempo/tempo.yaml
  • docs/admin_guide.md
  • e2e/helpers/bot-config.ts
  • e2e/helpers/plugincontainer.ts
  • e2e/helpers/real-api-container.ts
  • e2e/helpers/system-console-container.ts
  • e2e/tests/multiplayer-tool-calling/multiplayer-tool-calling.spec.ts
  • e2e/tests/system-console/debug-panel.spec.ts
  • e2e/tests/system-console/live-service-full-flow.spec.ts
  • evals/graders.go
  • go.mod
  • llm/language_model.go
  • llm/logging.go
  • llm/mocks/language_model_mock.go
  • llm/structured_output_fallback.go
  • llm/structured_output_fallback_test.go
  • llm/token_tracking.go
  • llm/token_tracking_bench_test.go
  • llm/token_tracking_test.go
  • llm/tools.go
  • llm/tools_test.go
  • llm/truncation.go
  • llmcontext/llm_context.go
  • llmcontext/llm_context_test.go
  • mcp/client.go
  • mcpserver/eval_helpers_test.go
  • mcpserver/tools_eval_test.go
  • meetings/meeting_summarization.go
  • metrics/metrics.go
  • metrics/noop.go
  • mmtools/web_search.go
  • mmtools/web_search_test.go
  • prompts/standard_personality_without_locale_test.go
  • react/react.go
  • react/react_test.go
  • search/search.go
  • search/search_conversation_test.go
  • search/search_test.go
  • server/main.go
  • store/config_test.go
  • streaming/streaming.go
  • telemetry/attributes.go
  • telemetry/integration_test.go
  • telemetry/telemetry.go
  • telemetry/telemetry_test.go
  • threads/analysis_conversation_test.go
  • threads/threads.go
  • threads/threads_test.go
  • toolrunner/toolrunner.go
  • toolrunner/toolrunner_test.go
  • webapp/src/components/system_console/config.tsx
  • webapp/src/components/system_console/plugin_config_types.tsx
  • websearch/brave.go
  • websearch/google.go
💤 Files with no reviewable changes (14)
  • bifrost/models.go
  • e2e/helpers/system-console-container.ts
  • e2e/helpers/bot-config.ts
  • llmcontext/llm_context_test.go
  • e2e/tests/multiplayer-tool-calling/multiplayer-tool-calling.spec.ts
  • store/config_test.go
  • bots/bots.go
  • api/api_no_tools_test.go
  • e2e/tests/system-console/live-service-full-flow.spec.ts
  • e2e/helpers/plugincontainer.ts
  • metrics/noop.go
  • e2e/helpers/real-api-container.ts
  • bots/bots_test.go
  • metrics/metrics.go

Comment thread docs/admin_guide.md Outdated
Comment thread e2e/tests/system-console/debug-panel.spec.ts Outdated
Comment thread go.mod
Comment thread llm/language_model.go
Comment thread llm/tools.go
Comment thread search/search_conversation_test.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
telemetry/detach_test.go (1)

44-56: ⚡ Quick win

Prefer a local tracer to avoid mutating the global OTel provider.

The save/restore pattern here is correct, but using a local tracer from tp is cleaner and keeps the test fully isolated. Since the test only needs to verify that span relationships are preserved (not that a specific named tracer is used), you can replace telemetry.Tracer() with a tracer obtained directly from the test provider.

Suggested change
-	"go.opentelemetry.io/otel"
 	sdktrace "go.opentelemetry.io/otel/sdk/trace"
 	"go.opentelemetry.io/otel/sdk/trace/tracetest"
@@
 	exporter := tracetest.NewInMemoryExporter()
 	tp := sdktrace.NewTracerProvider(sdktrace.WithSyncer(exporter))
-	prev := otel.GetTracerProvider()
-	otel.SetTracerProvider(tp)
+	tracer := tp.Tracer("test")
 	t.Cleanup(func() {
 		_ = tp.Shutdown(context.Background())
-		otel.SetTracerProvider(prev)
 	})
 
-	parent, parentSpan := telemetry.Tracer().Start(context.Background(), "parent")
+	parent, parentSpan := tracer.Start(context.Background(), "parent")
@@
-	_, childSpan := telemetry.Tracer().Start(detached, "child")
+	_, childSpan := tracer.Start(detached, "child")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@telemetry/detach_test.go` around lines 44 - 56, The test currently mutates
the global OTel provider and then calls telemetry.Tracer(); instead, use the
local test provider's tracer to avoid relying on global state: replace calls to
telemetry.Tracer().Start(...) with tp.Tracer("test").Start(...) for creating the
parent and child spans (keeping telemetry.DetachContext(parent) and the existing
cleanup/shutdown logic), so the test uses the local tracer from tp and still
asserts parent/child TraceID equality.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@search/search_test.go`:
- Around line 744-750: The test currently blocks indefinitely waiting on the
mockEmbedding Search hook by doing a plain receive from processSearchEntered
(created alongside processSearchProceed) inside the mock Run callback; change
the wait to use a bounded wait (for example, replace the direct
<-processSearchEntered with a select that listens for processSearchEntered and a
time.After timeout, or use require.Eventually/require.Eventuallyf to assert the
mock was entered) so the test fails fast instead of hanging; update the
corresponding waits around the mockEmbedding.On("Search", ...) Run closure and
any places that read processSearchEntered/processSearchProceed to use this
timeout-based pattern.

In `@search/search.go`:
- Around line 512-513: The fallback direct LLM call
(bot.LLM().ChatCompletionNoStream(ctx, prompt)) can still invoke tools; change
the call to pass the tools-disabled option so tools are disabled in the
non-conversation path (e.g., call ChatCompletionNoStream with
llm.WithToolsDisabled()). Update the invocation in the fallback path where
ChatCompletionNoStream is used and ensure the same option is applied as in the
conversation-backed paths so behavior no longer depends on conversationService
wiring.
- Around line 299-302: The OpenTelemetry spans started in processSearch and
SearchQuery need to record and mark errors on any failing execution paths:
locate the span variable in processSearch and in SearchQuery and on every error
return (including deferred errors or early returns) call span.RecordError(err)
and span.SetStatus(codes.Error, err.Error()) before returning; ensure the codes
symbol from the OpenTelemetry package is imported if not already, and add these
calls consistently for all error branches so failed searches are visible in
traces.

---

Nitpick comments:
In `@telemetry/detach_test.go`:
- Around line 44-56: The test currently mutates the global OTel provider and
then calls telemetry.Tracer(); instead, use the local test provider's tracer to
avoid relying on global state: replace calls to telemetry.Tracer().Start(...)
with tp.Tracer("test").Start(...) for creating the parent and child spans
(keeping telemetry.DetachContext(parent) and the existing cleanup/shutdown
logic), so the test uses the local tracer from tp and still asserts parent/child
TraceID equality.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 0abb2ac8-2de1-467d-aecb-8eb17a452e7d

📥 Commits

Reviewing files that changed from the base of the PR and between c2beb63 and efc1944.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (10)
  • api/api_channel.go
  • api/api_post.go
  • docs/admin_guide.md
  • e2e/tests/system-console/debug-panel.spec.ts
  • go.mod
  • search/search.go
  • search/search_conversation_test.go
  • search/search_test.go
  • telemetry/detach_test.go
  • telemetry/telemetry.go
✅ Files skipped from review due to trivial changes (2)
  • docs/admin_guide.md
  • e2e/tests/system-console/debug-panel.spec.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • telemetry/telemetry.go
  • search/search_conversation_test.go
  • api/api_channel.go

Comment thread search/search_test.go
Comment thread search/search.go
Comment thread search/search.go
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

Security Review: OpenTelemetry Tracing

No new medium, high, or critical security vulnerabilities identified in this PR.

What was reviewed

This PR adds OpenTelemetry distributed tracing infrastructure to the Agents plugin, threading context.Context through the LLM call chain, adding span instrumentation at major boundaries (HTTP, LLM calls, tool execution, MCP, search, web search, streaming), and replacing the legacy EnableLLMTrace feature with structured OTel exports.

Security-relevant aspects examined

  • Span attributes contain only operational metadata: Exported attributes are limited to entity IDs (ai.user.id, ai.channel.id, ai.post.id), provider/model identifiers, operation types, token counts, tool names, and MCP server names. No post content, LLM prompts, system instructions, tool arguments, tool results, API keys, or user-generated content is included in span data.

  • Legacy EnableLLMTrace removal is a security improvement: The removed LanguageModelLogWrapper and TraceResolved/TraceUnknown methods logged full conversation data including prompts, tool arguments, and tool results to plugin logs. The OTel replacement exports only structured metadata — this eliminates a data exposure vector.

  • Error messages are sanitized: All error paths in the plugin's own spans (bifrost.go streamChat/streamResponses) pass through SanitizeProviderError() which strips the API key before recording on spans. The Bifrost tracer adapter receives error messages from Bifrost's internal callbacks (PopulateLLMResponseAttributes, EndSpan) without explicit sanitization, but these messages originate from LLM provider API responses which do not echo API keys, and the trace data is sent to admin-controlled infrastructure.

  • DetachContext correctly handles async streaming: The new telemetry.DetachContext() helper preserves the OTel span while breaking the cancellation chain. This replaces the previous context.Background() usage and is properly applied at all three StreamToNewDM call sites and the async processSearch goroutine.

  • Auth/permission model unchanged: All HTTP handlers continue to authenticate via the server-injected Mattermost-User-Id header. No new endpoints, data access paths, or permission checks were introduced. The context.Context plumbing is purely for trace propagation, not for carrying auth or permission state.

  • OpenTelemetryEndpoint is admin-only configuration: The endpoint field is exposed only in the System Console Debug panel to system admins. The gRPC exporter makes outbound connections using the OTLP protocol (not arbitrary HTTP), sending only trace protobuf data. This is standard admin infrastructure configuration, not an SSRF vector.

  • No hardcoded credentials: No API keys, tokens, secrets, or placeholder credentials in the diff.

  • resource.WithHost() includes hostname: The OTel resource includes the server hostname. This is standard OTel practice and the data goes to admin-configured infrastructure only.

  • WithInsecure() gRPC transport: Documented in the admin guide with guidance to route through an OTel Collector for TLS. Since this is an admin-only config field with an explicit help text note, this is an intentional operational tradeoff.

Open in Web View Automation 

Sent by Cursor Automation: Find vulnerabilities

@crspeller crspeller requested a review from nickmisasi May 5, 2026 18:00
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

Security Review: OpenTelemetry Tracing

No new medium, high, or critical security vulnerabilities identified in this PR.

All CI checks pass (31/33 passed, 2 skipped/cancelled, 0 failed).

What was reviewed

This PR adds OpenTelemetry distributed tracing infrastructure to the Agents plugin, threading context.Context through the LLM call chain, adding span instrumentation at major boundaries (HTTP, LLM calls, tool execution, MCP, search, web search, streaming), and replacing the legacy EnableLLMTrace feature with structured OTel exports.

Security-relevant aspects examined

  • Span attributes contain only operational metadata: Exported attributes are limited to entity IDs (ai.user.id, ai.channel.id, ai.post.id), provider/model identifiers, operation types, token counts, tool names, and MCP server names. No post content, LLM prompts, system instructions, tool arguments, tool results, API keys, or user-generated content is included in span data.

  • Legacy EnableLLMTrace removal is a security improvement: The removed LanguageModelLogWrapper and TraceResolved/TraceUnknown methods logged full conversation data including prompts, tool arguments, and tool results to plugin logs. The OTel replacement exports only structured span metadata — this eliminates a data exposure vector.

  • Error messages are sanitized: All error paths in the plugin's own spans (bifrost.go streamChat/streamResponses) pass through SanitizeProviderError() which strips the API key before recording on spans. The Bifrost tracer adapter receives error messages from Bifrost's internal callbacks (PopulateLLMResponseAttributes, EndSpan) without explicit sanitization, but these messages originate from LLM provider API responses which do not echo API keys.

  • DetachContext correctly handles async streaming: The telemetry.DetachContext() helper preserves the OTel span while breaking the cancellation chain. This replaces the previous context.Background() usage and is properly applied at all StreamToNewDM call sites and the async processSearch goroutine.

  • Auth/permission model unchanged: All HTTP handlers continue to authenticate via the server-injected Mattermost-User-Id header. No new endpoints, data access paths, or permission checks were introduced. The context.Context plumbing is purely for trace propagation, not for carrying auth or permission state.

  • OpenTelemetryEndpoint is admin-only configuration: The endpoint field is exposed only in the System Console Debug panel to system admins. The gRPC exporter uses the OTLP protocol, sending only trace protobuf data. This is standard admin infrastructure configuration.

  • No hardcoded credentials: No API keys, tokens, secrets, or placeholder credentials in the diff.

  • WithInsecure() gRPC transport: Documented in the admin guide with guidance to route through an OTel Collector for TLS. Since this is an admin-only config field with explicit help text, this is an intentional operational tradeoff, not a vulnerability.

  • resource.WithHost() includes hostname: Standard OTel practice; data goes to admin-configured infrastructure only.

Open in Web View Automation 

Sent by Cursor Automation: Find vulnerabilities

Copy link
Copy Markdown
Collaborator

@nickmisasi nickmisasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good. A couple comments and suggestions.

One question that I may have missed the answer too in code: do we have the right span ID propogation/rehydration in order to allow us to correlate sessions across nodes?

I'm thinking through an example of an invocation of an agent that calls a tool (all one node) but that tool needs acceptance, and the acceptance gets routed to a different node.

>
<ItemList>
<BooleanItem
label={intl.formatMessage({defaultMessage: 'Enable LLM Trace'})}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about removing this - not all deployments will have access to something like Tempo or Jaeger, and this might be our only debug surface to identify issues

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a console log output option for the tracing.

Comment thread telemetry/attributes.go Outdated
Comment on lines +24 to +25
BotName = attribute.Key("ai.bot.name")
BotID = attribute.Key("ai.bot.id")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should probably start using agent now since we plan to move to that

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

Comment thread streaming/streaming.go
Comment on lines +441 to +444
trace.WithAttributes(
telemetry.PostID.String(post.Id),
telemetry.ChannelID.String(post.ChannelId),
),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add the root post ID here, we'll be able to correlate spans that are created throughout a thread

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, I think there is a lot more stuff to add. Need some follow up PRs to refine this into something really useful.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
conversation/service.go (1)

570-575: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t use Background() for title generation calls.

Line 587 severs parent trace and cancellation by starting from stdcontext.Background() instead of the caller’s context.

Targeted fix
 func (s *Service) GenerateTitle(
+    ctx stdcontext.Context,
     conversationID string,
     lm llm.LanguageModel,
     userMessage string,
     context *llm.Context,
 ) error {
@@
-    title, err := lm.ChatCompletionNoStream(stdcontext.Background(), req,
+    title, err := lm.ChatCompletionNoStream(ctx, req,
         llm.WithMaxGeneratedTokens(25),
         llm.WithReasoningDisabled(),
         llm.WithToolsDisabled(),
     )

As per coding guidelines, "All functions in the request pipeline must accept ctx context.Context as the first parameter and propagate it from entry points ... through to LLM calls and external services for OpenTelemetry tracing."

Also applies to: 587-591

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@conversation/service.go` around lines 570 - 575, The GenerateTitle method on
type Service currently severs tracing/cancellation by using
stdcontext.Background(); change its signature to accept ctx context.Context as
the first parameter (func (s *Service) GenerateTitle(ctx context.Context,
conversationID string, ...)), update all callers to pass their context, replace
any stdcontext.Background() usages inside GenerateTitle with the incoming ctx,
and ensure that ctx is propagated into all LLM/external calls (e.g., the
llm.LanguageModel call and any calls that currently receive
stdcontext.Background()) so OpenTelemetry tracing and cancellation flow
end-to-end.
🧹 Nitpick comments (2)
conversations/tool_approval.go (1)

128-142: ⚡ Quick win

Per-tool resolution spans provide good observability.

Each tool resolution gets its own span with name, ID, and status attributes. The status is correctly set before span.End() is called.

Consider also calling span.RecordError(resolveErr) and span.SetStatus(codes.Error, ...) when an error occurs, per the coding guidelines for consistent error recording. Currently only Bifrost spans record errors this way.

Proposed enhancement for error recording
 			if resolveErr != nil {
 				toolSpan.SetAttributes(telemetry.ToolStatus.String("error"))
+				toolSpan.RecordError(resolveErr)
+				toolSpan.SetStatus(codes.Error, resolveErr.Error())
 			} else {
 				toolSpan.SetAttributes(telemetry.ToolStatus.String("success"))
 			}

This requires importing "go.opentelemetry.io/otel/codes".

As per coding guidelines: "Record errors with span.RecordError(err) and span.SetStatus(codes.Error, msg)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@conversations/tool_approval.go` around lines 128 - 142, The span for per-tool
resolution (started as toolCtx, toolSpan via telemetry.Tracer().Start in the
resolve tool block) currently sets only a status attribute on error; update the
error branch where resolveErr != nil to call toolSpan.RecordError(resolveErr)
and toolSpan.SetStatus(codes.Error, resolveErr.Error()) before toolSpan.End(),
and add the necessary import for "go.opentelemetry.io/otel/codes"; leave the
success branch unchanged.
telemetry/turn_id.go (1)

103-121: 💤 Low value

Consider logging or bounding the retry loop for rand.Read failures.

The infinite loops assume rand.Read will eventually produce valid (non-zero) IDs. While cryptographically this is near-certain, silently discarding the error from rand.Read means a broken entropy source would cause an infinite loop with no diagnostic output. This is extremely unlikely in practice, so this is a minor defensive suggestion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@telemetry/turn_id.go` around lines 103 - 121, The NewIDs method on
randomIDGenerator should stop silently looping on rand.Read failures: check the
error returned by rand.Read when filling trace.TraceID and trace.SpanID, add a
bounded retry (e.g. a maxAttempts constant) for each ID generation loop, and on
repeated failures log the error and return a safe fallback (e.g. zero IDs or
propagate a higher-level error depending on surrounding code) instead of looping
forever; update the loops in randomIDGenerator.NewIDs to increment attempts,
handle non-nil err from rand.Read, call your logger (or panic if appropriate for
your app) after exhausting attempts, and then return the fallback IDs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@conversation/service.go`:
- Around line 152-153: The method Service.GetInitiatingUserTurn (and the other
new request-path method around lines 180-181) must accept ctx context.Context as
the first parameter and propagate it to downstream calls; update the function
signatures to func (s *Service) GetInitiatingUserTurn(ctx context.Context,
conversationID, postID string) and any sibling method similarly, and pass ctx
into s.store.GetTurnsForConversation (and other store/handler calls) as well as
update all call sites to forward their ctx so OpenTelemetry trace context is
preserved.

---

Outside diff comments:
In `@conversation/service.go`:
- Around line 570-575: The GenerateTitle method on type Service currently severs
tracing/cancellation by using stdcontext.Background(); change its signature to
accept ctx context.Context as the first parameter (func (s *Service)
GenerateTitle(ctx context.Context, conversationID string, ...)), update all
callers to pass their context, replace any stdcontext.Background() usages inside
GenerateTitle with the incoming ctx, and ensure that ctx is propagated into all
LLM/external calls (e.g., the llm.LanguageModel call and any calls that
currently receive stdcontext.Background()) so OpenTelemetry tracing and
cancellation flow end-to-end.

---

Nitpick comments:
In `@conversations/tool_approval.go`:
- Around line 128-142: The span for per-tool resolution (started as toolCtx,
toolSpan via telemetry.Tracer().Start in the resolve tool block) currently sets
only a status attribute on error; update the error branch where resolveErr !=
nil to call toolSpan.RecordError(resolveErr) and toolSpan.SetStatus(codes.Error,
resolveErr.Error()) before toolSpan.End(), and add the necessary import for
"go.opentelemetry.io/otel/codes"; leave the success branch unchanged.

In `@telemetry/turn_id.go`:
- Around line 103-121: The NewIDs method on randomIDGenerator should stop
silently looping on rand.Read failures: check the error returned by rand.Read
when filling trace.TraceID and trace.SpanID, add a bounded retry (e.g. a
maxAttempts constant) for each ID generation loop, and on repeated failures log
the error and return a safe fallback (e.g. zero IDs or propagate a higher-level
error depending on surrounding code) instead of looping forever; update the
loops in randomIDGenerator.NewIDs to increment attempts, handle non-nil err from
rand.Read, call your logger (or panic if appropriate for your app) after
exhausting attempts, and then return the fallback IDs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 01d3ed62-887f-4241-a4c0-6682377a84c9

📥 Commits

Reviewing files that changed from the base of the PR and between fb445a9 and 091913e.

📒 Files selected for processing (9)
  • conversation/service.go
  • conversation/service_test.go
  • conversations/conversations.go
  • conversations/handle_messages.go
  • conversations/regeneration.go
  • conversations/tool_approval.go
  • telemetry/telemetry.go
  • telemetry/turn_id.go
  • telemetry/turn_id_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • conversation/service_test.go

Comment thread conversation/service.go
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

Security Review — OpenTelemetry Tracing

Reviewed all 87 changed files (2,754 additions / 788 deletions) for vulnerabilities matching the security checklist: channel data isolation, identity spoofing, inter-plugin bridge trust, tool/MCP boundaries, prompt injection, file access control, search/embedding scoping, secrets exposure, and standard injection/auth bypass risks.

Summary

No medium, high, or critical security findings. This PR is primarily infrastructure plumbing — adding context.Context as a first parameter across the request pipeline and creating OpenTelemetry spans at system boundaries. The changes do not alter authentication, authorization, data access control, or permission-checking logic.

What was verified

  • No sensitive data in spans: Span attributes are limited to internal IDs (ai.user.id, ai.channel.id, ai.post.id), operational metadata (provider name, model, token counts, tool name/status), and error messages that are already sanitized via SanitizeProviderError. Post content, message bodies, prompts, tool arguments, and tool results are not recorded in any span.
  • Auth/permission logic unchanged: All userID values continue to originate from the server-injected Mattermost session header via c.Request.Context(). No new endpoints or data-access paths were introduced. The context.Context additions are purely for cancellation/tracing propagation.
  • Removed logging surface is a net improvement: The old LanguageModelLogWrapper (behind EnableLLMTrace) logged full conversation prompts, tool arguments, and tool results to the plugin log. Its removal eliminates a data-exposure surface. The OTel replacement records only metadata, not content.
  • OTLP endpoint is admin-only: The OpenTelemetryEndpoint config field is behind the System Console admin interface. The WithInsecure() gRPC option is documented in the admin guide with guidance to use a TLS-terminating collector for production. This is an acceptable admin-configured integration pattern.
  • No new secrets or credentials: No hardcoded keys, tokens, or credentials in the diff. API keys continue to be sanitized from error messages before recording on spans.
Open in Web View Automation 

Sent by Cursor Automation: Find vulnerabilities

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

Security Review — No Findings

Reviewed the full diff (91 files, ~2,750 additions) for the OpenTelemetry tracing infrastructure PR.

What was checked:

  • Span attributes for sensitive data leakage (API keys, message content, file data, credentials)
  • Auth/permission changes across all modified call paths
  • SSRF potential from the configurable OTLP endpoint
  • Data isolation impact of user/channel/post IDs in traces
  • Impact of removed logging code (LanguageModelLogWrapper, TraceResolved/TraceUnknown)
  • Context propagation changes (DetachContext, WithTurnID) for cancellation or identity issues

Assessment: No medium, high, or critical security vulnerabilities identified.

  • Span attributes are limited to metadata (IDs, tool names, token counts, provider/model). No message bodies, prompts, file contents, or secrets flow into traces.
  • Error messages pass through SanitizeProviderError before recording on spans, preserving API key scrubbing.
  • The removed LanguageModelLogWrapper and TraceResolved/TraceUnknown previously logged full prompt text and tool arguments — their removal is a net security improvement.
  • No auth checks are added, removed, or weakened. Context threading is purely mechanical.
  • The OTLP endpoint is admin-only configuration behind system-admin permission checks.
Open in Web View Automation 

Sent by Cursor Automation: Find vulnerabilities

@crspeller
Copy link
Copy Markdown
Member Author

One question that I may have missed the answer too in code: do we have the right span ID propogation/rehydration in order to allow us to correlate sessions across nodes?

@nickmisasi We did not, but we do now.

@crspeller crspeller requested a review from nickmisasi May 6, 2026 17:18
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

Security Review — OpenTelemetry Tracing PR

Reviewed the full diff (93 files, +3133/−790) for security concerns relevant to a multi-user Mattermost plugin environment.

Summary

This PR is primarily observability infrastructure — adding context.Context threading and OpenTelemetry span instrumentation. It does not introduce new data-access paths, permission bypasses, or user-controlled inputs that reach sensitive sinks. The authorization model is unchanged; the config endpoint remains admin-gated (mattermostAdminAuthorizationRequired). The otelgin middleware does not capture HTTP headers by default, so the injected Mattermost-User-ID header is not leaked into spans.

Positive security note: The removal of LanguageModelLogWrapper (which previously logged full conversation prompts via fmt.Sprintf(" %v", request)) is a net improvement — it eliminates a path where user message content was written to server logs.

One medium-severity finding identified below regarding unsanitized provider error messages flowing into trace data.

Open in Web View Automation 

Sent by Cursor Automation: Find vulnerabilities

Comment thread bifrost/tracer.go
Copy link
Copy Markdown
Collaborator

@nickmisasi nickmisasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

crspeller added 2 commits May 7, 2026 13:10
…y-tracing-138e

# Conflicts:
#	api/api_llm_bridge.go
#	api/fake_llm_test.go
@dryrunsecurity
Copy link
Copy Markdown

dryrunsecurity Bot commented May 7, 2026

DryRun Security

This pull request introduces a critical LLM prompt injection risk in api/api_channel.go by inserting the user-provided Prompt field directly into the system prompt template, where a malicious user could break out of the quoted instruction block and override system instructions. Although the prompt is wrapped as an additional instruction, it is still vulnerable to bypassing constraints like tool limits or required citation formats.

🔴 LLM Prompt Injection in api/api_channel.go (drs_5632491e)
Vulnerability LLM Prompt Injection
Description The user-provided 'Prompt' field from the analysis request is injected directly into the system prompt template (summarize_channel_system.tmpl). While there is an attempt to isolate it by wrapping it in quotes and adding it as an 'Additional instruction', this structure is vulnerable to common prompt injection techniques where a malicious user can escape the quoted instruction block and override the system-defined instructions (e.g., bypassing tool limitations or mandated citation formats).

result, err := analyzer.AnalyzeChannel(c.Request.Context(), llmContext, channel.Id, userID, bot.GetMMBot().UserId, analysisData)
if err != nil {
c.AbortWithError(http.StatusInternalServerError, fmt.Errorf("failed to analyze channel: %w", err))
return


Comment to provide feedback on these findings.

Report false positive: @dryrunsecurity fp [FINDING ID] [FEEDBACK]
Report low-impact: @dryrunsecurity nit [FINDING ID] [FEEDBACK]

Example: @dryrunsecurity fp drs_90eda195 This code is not user-facing

All finding details can be found in the DryRun Security Dashboard.

@crspeller
Copy link
Copy Markdown
Member Author

@DryRunSecurity fp drs_5632491e This is by design. We don't care about the user overriding instructions it's their server and their LLM.

@dryrunsecurity
Copy link
Copy Markdown

dryrunsecurity Bot commented May 7, 2026

Finding 5632491e has been successfully dismissed and marked as False Positive.

@crspeller crspeller merged commit 5ab61bf into master May 7, 2026
57 of 58 checks passed
@crspeller crspeller deleted the cursor/opentelemetry-tracing-138e branch May 7, 2026 13:51
Comment thread bifrost/tracer.go
h.span.SetStatus(otelcodes.Error, bErr.Error.Message)
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Unsanitized provider error message recorded on Bifrost child span

bErr.Error.Message is written directly to the span status without passing through llm.SanitizeProviderError. LLM provider error responses can contain API key material (e.g., OpenAI: "Incorrect API key provided: sk-proj-abc...xyz").

This value surfaces in two ways:

  1. Logs modelogSpanProcessor.OnEnd outputs status.Description as the "error" log field (log_processor.go:58), making it visible in server logs.
  2. OTLP mode — the span is exported to the configured collector.

The parent-level code in bifrost.go (lines 549, 604, 1547, 1614) correctly applies llm.SanitizeProviderError(err, b.apiKey) before recording errors on its span. This inconsistency means the Bifrost child span leaks what the parent span sanitizes.

Impact: If a provider echoes back API key material in an error response, that key would appear in trace logs or OTLP export, accessible to anyone with server-log or trace-backend access.

Suggested fix: Inject the API key into the otelTracer at construction time (it's available in newBifrostClient) and sanitize bErr.Error.Message here before setting the span status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants