MCP: make Elasticsearch failures visible as failing APM transactions by reakaleek · Pull Request #3466 · elastic/docs-builder

reakaleek · 2026-06-03T20:04:41Z

Why

During an Elasticsearch outage the MCP search tools were silently failing — APM showed HTTP 200 everywhere, /health returned healthy, and MCP clients received what looked like a successful (but empty) result. The failure was invisible to both monitoring and the calling model.

What

Three gaps closed:

Transaction span propagation — MarkFailure now calls activity.AddException(ex) and walks up to the ASP.NET Core server span to set ActivityStatusCode.Error there too. The HTTP transaction now appears as a failing transaction in APM (and populates the Errors view) even though the HTTP response stays 200, which is correct per MCP/JSON-RPC spec.

isError tool results — SemanticSearch and FindRelatedDocs now return CallToolResult { IsError = true } on exception instead of a 200 success body containing an {"error": "..."} field. The MCP client and model now know the tool actually failed.

Elasticsearch readiness check — ElasticsearchHealthCheck wires the existing ElasticsearchClientAccessor.CanConnect ping into /health so it returns 503 during an ES outage. /alive (liveness) is intentionally excluded to avoid pod crash-loops.

The DefaultSearchService in the underlying package throws with "Elasticsearch response is not valid" when ES returns an error response, so exception-based failure detection covers all ES failure modes (unreachable, error response, query rejection).

Test plan

dotnet test tests/Mcp.Remote.Tests — 17 tests pass including 2 new telemetry tests verifying the server span propagation
Point server at a bad ES endpoint → curl /health returns 503, curl /alive returns 200
Call search_* via MCP → tool result has isError: true, HTTP response is 200
With OTel/APM wired: the POST transaction to /mcp appears as a failed transaction

- MarkFailure now calls AddException and propagates ActivityStatusCode.Error to the root server span so APM shows the HTTP transaction as failed (not 200 OK) - SearchTools returns CallToolResult { IsError = true } on exception so the MCP client/model knows the tool actually failed - ElasticsearchHealthCheck wires the existing CanConnect ping into /health (returns 503 when ES is unreachable) without touching /alive (liveness) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-03T20:17:15Z

Warning

Review limit reached

@reakaleek, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 31 minutes and 30 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e493262f-dc4e-478e-bd81-b0eb0e1fa47f

📥 Commits

Reviewing files that changed from the base of the PR and between 39fa9cc and d1d452d.

📒 Files selected for processing (2)

src/api/Elastic.Documentation.Mcp.Remote/Health/ElasticsearchHealthCheck.cs
tests-integration/Mcp.Remote.IntegrationTests/SearchToolsIntegrationTests.cs

📝 Walkthrough

Walkthrough

This PR implements three interconnected improvements to the MCP remote service: adds an Elasticsearch readiness health check on the /health endpoint, standardizes tool response formats via a CallToolResult factory, and enhances exception telemetry to propagate errors to parent server spans. The project configuration is updated to expose internal types to the test assembly, and integration/unit tests are expanded to verify the new behavior.

Possibly related PRs

elastic/docs-builder#3356: The main PR's MCP health wiring (adding an Elasticsearch readiness check on /health while excluding it from /alive) directly supports the retrieved PR's new MCP smoke test that asserts GET /docs/_mcp/alive returns 200.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main objective: making Elasticsearch failures visible as failing APM transactions, which is the core theme across all changes.
Description check	✅ Passed	The description clearly explains the problem, three specific solutions, and verification steps, directly corresponding to the changeset's modifications.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

✨ Simplify code

Create PR with simplified code
Commit simplified code in branch lydian-iguana

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/api/Elastic.Documentation.Mcp.Remote/Telemetry/McpToolTelemetry.cs (1)

59-80: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Walk the passed tool activity’s parent chain when marking the server span failed.

MarkFailure(activity, ex) forwards activity, but FailServerSpan currently traverses Activity.Current, so server-span attribution can drift if the ambient current activity differs at catch-time.

Suggested fix

 	public static void MarkFailure(Activity? activity, Exception ex)
 	{
 		_ = activity?.SetTag("mcp.call.success", false);
 		_ = activity?.SetTag("mcp.call.error_type", ex.GetType().FullName);
 		_ = activity?.SetTag("error.message", ex.Message);
 		_ = activity?.AddException(ex);
 		_ = activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
-		FailServerSpan(ex);
+		FailServerSpan(activity, ex);
 	}
 
-	private static void FailServerSpan(Exception ex)
+	private static void FailServerSpan(Activity? activity, Exception ex)
 	{
-		for (var a = Activity.Current; a is not null; a = a.Parent)
+		for (var a = activity?.Parent; a is not null; a = a.Parent)
 		{
 			if (a.Kind != ActivityKind.Server)
 				continue;
 			_ = a.SetStatus(ActivityStatusCode.Error, ex.Message);
 			_ = a.AddException(ex);
 			return;
 		}
 	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/Elastic.Documentation.Mcp.Remote/Telemetry/McpToolTelemetry.cs`
around lines 59 - 80, The failure-marking currently walks Activity.Current in
FailServerSpan(Exception ex), which can misattribute server spans; change
FailServerSpan to accept the activity that failed (e.g.,
FailServerSpan(Activity? activity, Exception ex)) and iterate up the passed
activity's parent chain (for (var a = activity; a is not null; a = a.Parent)) to
find the ActivityKind.Server span, then SetStatus/AddException on that span;
update MarkFailure(Activity? activity, Exception ex) to call the new
FailServerSpan(activity, ex) and preserve the existing null-safe behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/api/Elastic.Documentation.Mcp.Remote/Health/ElasticsearchHealthCheck.cs`:
- Around line 28-30: The health check currently converts any exception into
HealthCheckResult.Unhealthy in ElasticsearchHealthCheck.CheckHealthAsync
(wrapping accessor.CanConnect which awaits Client.PingAsync), which incorrectly
swallows caller cancellation; change the catch so OperationCanceledException
(and any exception with IsCancellationRequested on the passed token) is not
converted to Unhealthy but is rethrown (or allowed to propagate) while other
exceptions still return HealthCheckResult.Unhealthy("Elasticsearch unreachable",
ex); also add a unit test that cancels the caller CancellationToken before
accessor.CanConnect completes and asserts CheckHealthAsync throws
OperationCanceledException instead of returning an Unhealthy result.

In
`@tests-integration/Mcp.Remote.IntegrationTests/SearchToolsIntegrationTests.cs`:
- Around line 28-32: Before casting and parsing the returned payload, assert
that the operation succeeded by checking result.IsError is false; in the test
methods calling searchTools.SemanticSearch (and analogous calls at the other
locations) add an assertion like Assert.False(result.IsError, $"SemanticSearch
failed: {result.Error}") immediately after awaiting the call and before
accessing result.Content or casting to TextContentBlock so failures surface
clearly instead of causing downstream JSON/DTO parsing errors.

---

Outside diff comments:
In `@src/api/Elastic.Documentation.Mcp.Remote/Telemetry/McpToolTelemetry.cs`:
- Around line 59-80: The failure-marking currently walks Activity.Current in
FailServerSpan(Exception ex), which can misattribute server spans; change
FailServerSpan to accept the activity that failed (e.g.,
FailServerSpan(Activity? activity, Exception ex)) and iterate up the passed
activity's parent chain (for (var a = activity; a is not null; a = a.Parent)) to
find the ActivityKind.Server span, then SetStatus/AddException on that span;
update MarkFailure(Activity? activity, Exception ex) to call the new
FailServerSpan(activity, ex) and preserve the existing null-safe behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2e9e0117-918b-451d-aeb0-43ba7942653a

📥 Commits

Reviewing files that changed from the base of the PR and between 711750f and 39fa9cc.

📒 Files selected for processing (8)

src/api/Elastic.Documentation.Mcp.Remote/Elastic.Documentation.Mcp.Remote.csproj
src/api/Elastic.Documentation.Mcp.Remote/Health/ElasticsearchHealthCheck.cs
src/api/Elastic.Documentation.Mcp.Remote/Program.cs
src/api/Elastic.Documentation.Mcp.Remote/Responses/McpToolResults.cs
src/api/Elastic.Documentation.Mcp.Remote/Telemetry/McpToolTelemetry.cs
src/api/Elastic.Documentation.Mcp.Remote/Tools/SearchTools.cs
tests-integration/Mcp.Remote.IntegrationTests/SearchToolsIntegrationTests.cs
tests/Mcp.Remote.Tests/McpToolTelemetryTests.cs

…h check, assert IsError in integration tests Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

reakaleek and others added 2 commits June 3, 2026 21:56

MCP: fix integration tests broken by CallToolResult return type change

39fa9cc

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

reakaleek requested a review from a team as a code owner June 3, 2026 20:04

reakaleek requested a review from Mpdreamz June 3, 2026 20:04

reakaleek added the enhancement label Jun 3, 2026

reakaleek had a problem deploying to integration-tests June 3, 2026 20:04 — with GitHub Actions Failure

github-actions Bot deployed to assembler-preview June 3, 2026 20:05 View deployment

coderabbitai Bot requested changes Jun 3, 2026

View reviewed changes

Comment thread src/api/Elastic.Documentation.Mcp.Remote/Health/ElasticsearchHealthCheck.cs

Comment thread tests-integration/Mcp.Remote.IntegrationTests/SearchToolsIntegrationTests.cs

MCP: address coderabbit review — rethrow caller cancellation in healt…

d1d452d

…h check, assert IsError in integration tests Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

reakaleek had a problem deploying to integration-tests June 3, 2026 20:33 — with GitHub Actions Failure

github-actions Bot deployed to assembler-preview June 3, 2026 20:33 View deployment

reakaleek had a problem deploying to integration-tests June 3, 2026 22:37 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCP: make Elasticsearch failures visible as failing APM transactions#3466

MCP: make Elasticsearch failures visible as failing APM transactions#3466
reakaleek wants to merge 3 commits into
mainfrom
lydian-iguana

reakaleek commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

Review limit reached

Walkthrough

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

reakaleek commented Jun 3, 2026

Why

What

Test plan

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading