feat: Add MCP connector and MCP server metrics#4803
Conversation
PR Reviewer Guide 🔍(Review updated until commit 625b272)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 625b272 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit cc3d6dd
Suggestions up to commit 3fed456
Suggestions up to commit 32ceee9
Suggestions up to commit 463d8ff
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4803 +/- ##
============================================
+ Coverage 77.42% 77.45% +0.03%
- Complexity 11907 11933 +26
============================================
Files 963 967 +4
Lines 53326 53450 +124
Branches 6503 6507 +4
============================================
+ Hits 41285 41401 +116
- Misses 9289 9298 +9
+ Partials 2752 2751 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Persistent review updated to latest commit 32ceee9 |
Adds OTel-backed metrics for MCP connectors (client-side) and the ml-commons MCP server (server-side), following the pattern established in opensearch-project#4221. New metrics: - ml.commons.MCP_CONNECTOR_COUNT (adoption, tag: protocol) - ml.commons.MCP_CONNECTOR_TOOL_INVOCATION_COUNT/_LATENCY (operational, tags: protocol, status) - ml.commons.MCP_SERVER_REGISTERED_TOOL_COUNT (adoption, tag: tool_type) - ml.commons.MCP_SERVER_TOOL_CALL_COUNT/_LATENCY (operational, tags: tool_type, status) - ml.commons.MCP_SERVER_REQUEST_COUNT (operational, tags: method, status) Adoption metrics are emitted by MLStatsJobProcessor via periodic index scans; operational metrics are emitted at call sites. "Domains using MCP" is derived upstream from the auto-injected cluster_name tag. Moves AbstractMLMetricsCounter and MetricType from plugin to common so ml-algorithms can reference the counter singletons. Also removes the isMetricCollectionEnabled() gate around counter initialization in MachineLearningPlugin. Counter emission is already gated inside the abstract base, so initializing eagerly makes getInstance() safe at all call sites and fixes a latent IllegalStateException when metric collection is disabled. Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>
- Add testMetricsEmittedOnSuccessAndFailure to McpStreamableHttpToolTests (mirrors the SSE test; both tools have independent recordInvocation paths) - Expand MachineLearningPlugin init comment with the full repro of the IllegalStateException bug that motivated eager counter initialization - Add MLMcpMetricsCountersTests in common/src/test covering the MCP counter singleton scaffolding (JaCoCo reports per-module; call-site tests in other modules don't cover the new counter classes in common) - Drop redundant tearDown reset in MLStatsJobProcessorTests (setUp reset is sufficient and matches the pattern used by MLAdoptionMetricsCounterTests) Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>
Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>
- Histogram tags: pass the same protocol+status / tool_type+status Tags object to recordHistogram so latency can be sliced by success vs failure (McpSseTool, McpStreamableHttpTool, McpToolsHelper). - TransportMcpServerAction.extractMethod: collapse anything outside the server-served allowlist (initialize, notifications/initialized, ping, tools/list, tools/call, logging/setLevel) to "other" so untrusted client input cannot inflate the method-tag cardinality. - MLStatsJobProcessor: log a warning when a metrics scan returns more hits than BATCH_SIZE so operators notice silent under-counting until pagination is added. Drop the unreachable "unknown" protocol fallback in the connector scan (search query already filters on protocol). - MachineLearningPlugin: hoist clusterName local across the four counter initialize() calls. - Tests: assert the histogram receives the status tag in the SSE/StreamableHttp tool tests; add a TransportMcpServerAction test proving an off-allowlist method is tagged as "other". Signed-off-by: rithinpullela <rithinp@amazon.com>
32ceee9 to
3fed456
Compare
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 3fed456.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
Persistent review updated to latest commit 3fed456 |
Signed-off-by: rithinpullela <rithinp@amazon.com>
|
Persistent review updated to latest commit cc3d6dd |
Extract recordInvocation duplication between McpSseTool and McpStreamableHttpTool into a package-private McpToolMetrics helper. Strengthen MLMcpMetricsCountersTests to mirror the MLAdoptionMetricsCounterTests / MLOperationalMetricsCounterTests pattern (registry creation counts, increment/record routing, feature-flag toggle). Signed-off-by: rithinpullela <rithinp@amazon.com>
|
Persistent review updated to latest commit 625b272 |
Description
Adds OTel-backed metrics for MCP connectors (client-side — used by agents to call external MCP servers) and the ml-commons MCP server (server-side — exposes ml-commons tools over MCP). Follows the metrics pattern established in #4221 (
MLAdoptionMetricsCounter/MLOperationalMetricsCounter).New metrics
Adoption (periodic scan in
MLStatsJobProcessor):ml.commons.MCP_CONNECTOR_COUNTprotocol(mcp_sse/mcp_streamable_http)ml.commons.MCP_SERVER_REGISTERED_TOOL_COUNTtool_typeOperational (emitted at call sites):
ml.commons.MCP_CONNECTOR_TOOL_INVOCATION_COUNT/_LATENCYprotocol,statusml.commons.MCP_SERVER_TOOL_CALL_COUNT/_LATENCYtool_type,statusml.commons.MCP_SERVER_REQUEST_COUNTmethod,statuscluster_nameis auto-injected byAbstractMLMetricsCounter, so "# domains using MCP connector/server" is derived upstream fromCOUNT(DISTINCT cluster_name)— same pattern as the existingMODEL_COUNT/AGENT_COUNTmetrics.The
methodtag onMCP_SERVER_REQUEST_COUNTcaptures the JSON-RPC method (initialize,tools/list,tools/call,notifications/initialized, orunknownon parse failure). Cardinality is bounded by the MCP spec's fixed method set; the tag makes the difference between "clients browsing the catalog" and "clients actually executing tools" visible.Module move
AbstractMLMetricsCounterandMetricTypeare moved fromplugin/tocommon/(same package path). Reason:ml-algorithmshostsMcpSseTool/McpStreamableHttpTooland can't depend onplugin. ExistingMLAdoptionMetricsCounter/MLOperationalMetricsCounterremain inplugin/untouched.Bug fix included
Removes the
if (mlFeatureEnabledSetting.isMetricCollectionEnabled())gate around counter initialization inMachineLearningPlugin. That guard only skippedinitialize()— every runtime call site (e.g.MLPredictTaskRunner, and now the MCP call sites) callsgetInstance()unconditionally. On the default install whereplugins.ml_commons.metrics_collection_enabled=false(aFinalsetting), the counters were never initialized, sogetInstance()threwIllegalStateExceptioninside request handlers and brokePOST /_plugins/_ml/mcpentirely.Reproduced live: first
tools/listcall on a stock cluster returnedJSON_RPC_INTERNAL_ERRORwith"MLMcpServerMetricsCounter is not initialized". Fix: alwaysinitialize()the counters; emission is already gated insideincrementCounter/recordHistogramby the same feature flag, so no behavior change when metrics are off. Addedtest_doExecute_succeedsWhenMetricCollectionDisabledas a regression guard. This latent bug also existed for the pre-existingMLOperationalMetricsCounterin the predict path; the fix covers both.Verification
MLMcpMetricsCountersTestsincommon), tool invocation (McpSseToolTests,McpStreamableHttpToolTests), server request (TransportMcpServerActionTests), tool-call wrapping (McpToolsHelperTests), scan collectors (MLStatsJobProcessorTests)../gradlew :opensearch-ml-plugin:run): registered a tool, firedtools/listandtools/call, observed correct emission with expected tags in both metrics-enabled and metrics-disabled configurations.Related Issues
N/A
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.