Skip to content

MCP: make Elasticsearch failures visible as failing APM transactions#3466

Open
reakaleek wants to merge 3 commits into
mainfrom
lydian-iguana
Open

MCP: make Elasticsearch failures visible as failing APM transactions#3466
reakaleek wants to merge 3 commits into
mainfrom
lydian-iguana

Conversation

@reakaleek
Copy link
Copy Markdown
Member

Why

During an Elasticsearch outage the MCP search tools were silently failing — APM showed HTTP 200 everywhere, /health returned healthy, and MCP clients received what looked like a successful (but empty) result. The failure was invisible to both monitoring and the calling model.

What

Three gaps closed:

Transaction span propagationMarkFailure now calls activity.AddException(ex) and walks up to the ASP.NET Core server span to set ActivityStatusCode.Error there too. The HTTP transaction now appears as a failing transaction in APM (and populates the Errors view) even though the HTTP response stays 200, which is correct per MCP/JSON-RPC spec.

isError tool resultsSemanticSearch and FindRelatedDocs now return CallToolResult { IsError = true } on exception instead of a 200 success body containing an {"error": "..."} field. The MCP client and model now know the tool actually failed.

Elasticsearch readiness checkElasticsearchHealthCheck wires the existing ElasticsearchClientAccessor.CanConnect ping into /health so it returns 503 during an ES outage. /alive (liveness) is intentionally excluded to avoid pod crash-loops.

The DefaultSearchService in the underlying package throws with "Elasticsearch response is not valid" when ES returns an error response, so exception-based failure detection covers all ES failure modes (unreachable, error response, query rejection).

Test plan

  • dotnet test tests/Mcp.Remote.Tests — 17 tests pass including 2 new telemetry tests verifying the server span propagation
  • Point server at a bad ES endpoint → curl /health returns 503, curl /alive returns 200
  • Call search_* via MCP → tool result has isError: true, HTTP response is 200
  • With OTel/APM wired: the POST transaction to /mcp appears as a failed transaction

reakaleek and others added 2 commits June 3, 2026 21:56
- MarkFailure now calls AddException and propagates ActivityStatusCode.Error
  to the root server span so APM shows the HTTP transaction as failed (not 200 OK)
- SearchTools returns CallToolResult { IsError = true } on exception so the
  MCP client/model knows the tool actually failed
- ElasticsearchHealthCheck wires the existing CanConnect ping into /health
  (returns 503 when ES is unreachable) without touching /alive (liveness)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 3, 2026

Review Change Stack

Warning

Review limit reached

@reakaleek, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 31 minutes and 30 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e493262f-dc4e-478e-bd81-b0eb0e1fa47f

📥 Commits

Reviewing files that changed from the base of the PR and between 39fa9cc and d1d452d.

📒 Files selected for processing (2)
  • src/api/Elastic.Documentation.Mcp.Remote/Health/ElasticsearchHealthCheck.cs
  • tests-integration/Mcp.Remote.IntegrationTests/SearchToolsIntegrationTests.cs
📝 Walkthrough

Walkthrough

This PR implements three interconnected improvements to the MCP remote service: adds an Elasticsearch readiness health check on the /health endpoint, standardizes tool response formats via a CallToolResult factory, and enhances exception telemetry to propagate errors to parent server spans. The project configuration is updated to expose internal types to the test assembly, and integration/unit tests are expanded to verify the new behavior.

Possibly related PRs

  • elastic/docs-builder#3356: The main PR's MCP health wiring (adding an Elasticsearch readiness check on /health while excluding it from /alive) directly supports the retrieved PR's new MCP smoke test that asserts GET /docs/_mcp/alive returns 200.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main objective: making Elasticsearch failures visible as failing APM transactions, which is the core theme across all changes.
Description check ✅ Passed The description clearly explains the problem, three specific solutions, and verification steps, directly corresponding to the changeset's modifications.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch lydian-iguana

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/api/Elastic.Documentation.Mcp.Remote/Telemetry/McpToolTelemetry.cs (1)

59-80: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Walk the passed tool activity’s parent chain when marking the server span failed.

MarkFailure(activity, ex) forwards activity, but FailServerSpan currently traverses Activity.Current, so server-span attribution can drift if the ambient current activity differs at catch-time.

Suggested fix
 	public static void MarkFailure(Activity? activity, Exception ex)
 	{
 		_ = activity?.SetTag("mcp.call.success", false);
 		_ = activity?.SetTag("mcp.call.error_type", ex.GetType().FullName);
 		_ = activity?.SetTag("error.message", ex.Message);
 		_ = activity?.AddException(ex);
 		_ = activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
-		FailServerSpan(ex);
+		FailServerSpan(activity, ex);
 	}
 
-	private static void FailServerSpan(Exception ex)
+	private static void FailServerSpan(Activity? activity, Exception ex)
 	{
-		for (var a = Activity.Current; a is not null; a = a.Parent)
+		for (var a = activity?.Parent; a is not null; a = a.Parent)
 		{
 			if (a.Kind != ActivityKind.Server)
 				continue;
 			_ = a.SetStatus(ActivityStatusCode.Error, ex.Message);
 			_ = a.AddException(ex);
 			return;
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/Elastic.Documentation.Mcp.Remote/Telemetry/McpToolTelemetry.cs`
around lines 59 - 80, The failure-marking currently walks Activity.Current in
FailServerSpan(Exception ex), which can misattribute server spans; change
FailServerSpan to accept the activity that failed (e.g.,
FailServerSpan(Activity? activity, Exception ex)) and iterate up the passed
activity's parent chain (for (var a = activity; a is not null; a = a.Parent)) to
find the ActivityKind.Server span, then SetStatus/AddException on that span;
update MarkFailure(Activity? activity, Exception ex) to call the new
FailServerSpan(activity, ex) and preserve the existing null-safe behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/api/Elastic.Documentation.Mcp.Remote/Health/ElasticsearchHealthCheck.cs`:
- Around line 28-30: The health check currently converts any exception into
HealthCheckResult.Unhealthy in ElasticsearchHealthCheck.CheckHealthAsync
(wrapping accessor.CanConnect which awaits Client.PingAsync), which incorrectly
swallows caller cancellation; change the catch so OperationCanceledException
(and any exception with IsCancellationRequested on the passed token) is not
converted to Unhealthy but is rethrown (or allowed to propagate) while other
exceptions still return HealthCheckResult.Unhealthy("Elasticsearch unreachable",
ex); also add a unit test that cancels the caller CancellationToken before
accessor.CanConnect completes and asserts CheckHealthAsync throws
OperationCanceledException instead of returning an Unhealthy result.

In
`@tests-integration/Mcp.Remote.IntegrationTests/SearchToolsIntegrationTests.cs`:
- Around line 28-32: Before casting and parsing the returned payload, assert
that the operation succeeded by checking result.IsError is false; in the test
methods calling searchTools.SemanticSearch (and analogous calls at the other
locations) add an assertion like Assert.False(result.IsError, $"SemanticSearch
failed: {result.Error}") immediately after awaiting the call and before
accessing result.Content or casting to TextContentBlock so failures surface
clearly instead of causing downstream JSON/DTO parsing errors.

---

Outside diff comments:
In `@src/api/Elastic.Documentation.Mcp.Remote/Telemetry/McpToolTelemetry.cs`:
- Around line 59-80: The failure-marking currently walks Activity.Current in
FailServerSpan(Exception ex), which can misattribute server spans; change
FailServerSpan to accept the activity that failed (e.g.,
FailServerSpan(Activity? activity, Exception ex)) and iterate up the passed
activity's parent chain (for (var a = activity; a is not null; a = a.Parent)) to
find the ActivityKind.Server span, then SetStatus/AddException on that span;
update MarkFailure(Activity? activity, Exception ex) to call the new
FailServerSpan(activity, ex) and preserve the existing null-safe behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2e9e0117-918b-451d-aeb0-43ba7942653a

📥 Commits

Reviewing files that changed from the base of the PR and between 711750f and 39fa9cc.

📒 Files selected for processing (8)
  • src/api/Elastic.Documentation.Mcp.Remote/Elastic.Documentation.Mcp.Remote.csproj
  • src/api/Elastic.Documentation.Mcp.Remote/Health/ElasticsearchHealthCheck.cs
  • src/api/Elastic.Documentation.Mcp.Remote/Program.cs
  • src/api/Elastic.Documentation.Mcp.Remote/Responses/McpToolResults.cs
  • src/api/Elastic.Documentation.Mcp.Remote/Telemetry/McpToolTelemetry.cs
  • src/api/Elastic.Documentation.Mcp.Remote/Tools/SearchTools.cs
  • tests-integration/Mcp.Remote.IntegrationTests/SearchToolsIntegrationTests.cs
  • tests/Mcp.Remote.Tests/McpToolTelemetryTests.cs

…h check, assert IsError in integration tests

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant