MCP: make Elasticsearch failures visible as failing APM transactions#3466
MCP: make Elasticsearch failures visible as failing APM transactions#3466reakaleek wants to merge 3 commits into
Conversation
- MarkFailure now calls AddException and propagates ActivityStatusCode.Error
to the root server span so APM shows the HTTP transaction as failed (not 200 OK)
- SearchTools returns CallToolResult { IsError = true } on exception so the
MCP client/model knows the tool actually failed
- ElasticsearchHealthCheck wires the existing CanConnect ping into /health
(returns 503 when ES is unreachable) without touching /alive (liveness)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 31 minutes and 30 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR implements three interconnected improvements to the MCP remote service: adds an Elasticsearch readiness health check on the Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/api/Elastic.Documentation.Mcp.Remote/Telemetry/McpToolTelemetry.cs (1)
59-80:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winWalk the passed tool activity’s parent chain when marking the server span failed.
MarkFailure(activity, ex)forwardsactivity, butFailServerSpancurrently traversesActivity.Current, so server-span attribution can drift if the ambient current activity differs at catch-time.Suggested fix
public static void MarkFailure(Activity? activity, Exception ex) { _ = activity?.SetTag("mcp.call.success", false); _ = activity?.SetTag("mcp.call.error_type", ex.GetType().FullName); _ = activity?.SetTag("error.message", ex.Message); _ = activity?.AddException(ex); _ = activity?.SetStatus(ActivityStatusCode.Error, ex.Message); - FailServerSpan(ex); + FailServerSpan(activity, ex); } - private static void FailServerSpan(Exception ex) + private static void FailServerSpan(Activity? activity, Exception ex) { - for (var a = Activity.Current; a is not null; a = a.Parent) + for (var a = activity?.Parent; a is not null; a = a.Parent) { if (a.Kind != ActivityKind.Server) continue; _ = a.SetStatus(ActivityStatusCode.Error, ex.Message); _ = a.AddException(ex); return; } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/api/Elastic.Documentation.Mcp.Remote/Telemetry/McpToolTelemetry.cs` around lines 59 - 80, The failure-marking currently walks Activity.Current in FailServerSpan(Exception ex), which can misattribute server spans; change FailServerSpan to accept the activity that failed (e.g., FailServerSpan(Activity? activity, Exception ex)) and iterate up the passed activity's parent chain (for (var a = activity; a is not null; a = a.Parent)) to find the ActivityKind.Server span, then SetStatus/AddException on that span; update MarkFailure(Activity? activity, Exception ex) to call the new FailServerSpan(activity, ex) and preserve the existing null-safe behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/api/Elastic.Documentation.Mcp.Remote/Health/ElasticsearchHealthCheck.cs`:
- Around line 28-30: The health check currently converts any exception into
HealthCheckResult.Unhealthy in ElasticsearchHealthCheck.CheckHealthAsync
(wrapping accessor.CanConnect which awaits Client.PingAsync), which incorrectly
swallows caller cancellation; change the catch so OperationCanceledException
(and any exception with IsCancellationRequested on the passed token) is not
converted to Unhealthy but is rethrown (or allowed to propagate) while other
exceptions still return HealthCheckResult.Unhealthy("Elasticsearch unreachable",
ex); also add a unit test that cancels the caller CancellationToken before
accessor.CanConnect completes and asserts CheckHealthAsync throws
OperationCanceledException instead of returning an Unhealthy result.
In
`@tests-integration/Mcp.Remote.IntegrationTests/SearchToolsIntegrationTests.cs`:
- Around line 28-32: Before casting and parsing the returned payload, assert
that the operation succeeded by checking result.IsError is false; in the test
methods calling searchTools.SemanticSearch (and analogous calls at the other
locations) add an assertion like Assert.False(result.IsError, $"SemanticSearch
failed: {result.Error}") immediately after awaiting the call and before
accessing result.Content or casting to TextContentBlock so failures surface
clearly instead of causing downstream JSON/DTO parsing errors.
---
Outside diff comments:
In `@src/api/Elastic.Documentation.Mcp.Remote/Telemetry/McpToolTelemetry.cs`:
- Around line 59-80: The failure-marking currently walks Activity.Current in
FailServerSpan(Exception ex), which can misattribute server spans; change
FailServerSpan to accept the activity that failed (e.g.,
FailServerSpan(Activity? activity, Exception ex)) and iterate up the passed
activity's parent chain (for (var a = activity; a is not null; a = a.Parent)) to
find the ActivityKind.Server span, then SetStatus/AddException on that span;
update MarkFailure(Activity? activity, Exception ex) to call the new
FailServerSpan(activity, ex) and preserve the existing null-safe behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2e9e0117-918b-451d-aeb0-43ba7942653a
📒 Files selected for processing (8)
src/api/Elastic.Documentation.Mcp.Remote/Elastic.Documentation.Mcp.Remote.csprojsrc/api/Elastic.Documentation.Mcp.Remote/Health/ElasticsearchHealthCheck.cssrc/api/Elastic.Documentation.Mcp.Remote/Program.cssrc/api/Elastic.Documentation.Mcp.Remote/Responses/McpToolResults.cssrc/api/Elastic.Documentation.Mcp.Remote/Telemetry/McpToolTelemetry.cssrc/api/Elastic.Documentation.Mcp.Remote/Tools/SearchTools.cstests-integration/Mcp.Remote.IntegrationTests/SearchToolsIntegrationTests.cstests/Mcp.Remote.Tests/McpToolTelemetryTests.cs
…h check, assert IsError in integration tests Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Why
During an Elasticsearch outage the MCP search tools were silently failing — APM showed HTTP 200 everywhere,
/healthreturned healthy, and MCP clients received what looked like a successful (but empty) result. The failure was invisible to both monitoring and the calling model.What
Three gaps closed:
Transaction span propagation —
MarkFailurenow callsactivity.AddException(ex)and walks up to the ASP.NET Core server span to setActivityStatusCode.Errorthere too. The HTTP transaction now appears as a failing transaction in APM (and populates the Errors view) even though the HTTP response stays 200, which is correct per MCP/JSON-RPC spec.isErrortool results —SemanticSearchandFindRelatedDocsnow returnCallToolResult { IsError = true }on exception instead of a 200 success body containing an{"error": "..."}field. The MCP client and model now know the tool actually failed.Elasticsearch readiness check —
ElasticsearchHealthCheckwires the existingElasticsearchClientAccessor.CanConnectping into/healthso it returns 503 during an ES outage./alive(liveness) is intentionally excluded to avoid pod crash-loops.The
DefaultSearchServicein the underlying package throws with "Elasticsearch response is not valid" when ES returns an error response, so exception-based failure detection covers all ES failure modes (unreachable, error response, query rejection).Test plan
dotnet test tests/Mcp.Remote.Tests— 17 tests pass including 2 new telemetry tests verifying the server span propagationcurl /healthreturns 503,curl /alivereturns 200search_*via MCP → tool result hasisError: true, HTTP response is 200/mcpappears as a failed transaction