feat: Add MCP connector and MCP server metrics by rithinpullela · Pull Request #4803 · opensearch-project/ml-commons

rithinpullela · 2026-04-28T19:03:27Z

Description

Adds OTel-backed metrics for MCP connectors (client-side — used by agents to call external MCP servers) and the ml-commons MCP server (server-side — exposes ml-commons tools over MCP). Follows the metrics pattern established in #4221 (MLAdoptionMetricsCounter / MLOperationalMetricsCounter).

New metrics

Adoption (periodic scan in MLStatsJobProcessor):

Metric	Tags
`ml.commons.MCP_CONNECTOR_COUNT`	`protocol` (`mcp_sse` / `mcp_streamable_http`)
`ml.commons.MCP_SERVER_REGISTERED_TOOL_COUNT`	`tool_type`

Operational (emitted at call sites):

Metric	Tags
`ml.commons.MCP_CONNECTOR_TOOL_INVOCATION_COUNT` / `_LATENCY`	`protocol`, `status`
`ml.commons.MCP_SERVER_TOOL_CALL_COUNT` / `_LATENCY`	`tool_type`, `status`
`ml.commons.MCP_SERVER_REQUEST_COUNT`	`method`, `status`

cluster_name is auto-injected by AbstractMLMetricsCounter, so "# domains using MCP connector/server" is derived upstream from COUNT(DISTINCT cluster_name) — same pattern as the existing MODEL_COUNT/AGENT_COUNT metrics.

The method tag on MCP_SERVER_REQUEST_COUNT captures the JSON-RPC method (initialize, tools/list, tools/call, notifications/initialized, or unknown on parse failure). Cardinality is bounded by the MCP spec's fixed method set; the tag makes the difference between "clients browsing the catalog" and "clients actually executing tools" visible.

Module move

AbstractMLMetricsCounter and MetricType are moved from plugin/ to common/ (same package path). Reason: ml-algorithms hosts McpSseTool / McpStreamableHttpTool and can't depend on plugin. Existing MLAdoptionMetricsCounter / MLOperationalMetricsCounter remain in plugin/ untouched.

Bug fix included

Removes the if (mlFeatureEnabledSetting.isMetricCollectionEnabled()) gate around counter initialization in MachineLearningPlugin. That guard only skipped initialize() — every runtime call site (e.g. MLPredictTaskRunner, and now the MCP call sites) calls getInstance() unconditionally. On the default install where plugins.ml_commons.metrics_collection_enabled=false (a Final setting), the counters were never initialized, so getInstance() threw IllegalStateException inside request handlers and broke POST /_plugins/_ml/mcp entirely.

Reproduced live: first tools/list call on a stock cluster returned JSON_RPC_INTERNAL_ERROR with "MLMcpServerMetricsCounter is not initialized". Fix: always initialize() the counters; emission is already gated inside incrementCounter / recordHistogram by the same feature flag, so no behavior change when metrics are off. Added test_doExecute_succeedsWhenMetricCollectionDisabled as a regression guard. This latent bug also existed for the pre-existing MLOperationalMetricsCounter in the predict path; the fix covers both.

Verification

Unit tests added/extended: counter scaffolding (MLMcpMetricsCountersTests in common), tool invocation (McpSseToolTests, McpStreamableHttpToolTests), server request (TransportMcpServerActionTests), tool-call wrapping (McpToolsHelperTests), scan collectors (MLStatsJobProcessorTests).
Manually verified on a local cluster (./gradlew :opensearch-ml-plugin:run): registered a tool, fired tools/list and tools/call, observed correct emission with expected tags in both metrics-enabled and metrics-disabled configurations.

Related Issues

N/A

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

github-actions · 2026-04-28T19:05:10Z

PR Reviewer Guide 🔍

(Review updated until commit `625b272`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Unbounded cardinality The `method` tag on `MCP_SERVER_REQUEST_COUNT` can still grow unbounded if a client sends arbitrary strings in the `method` field of a JSON-RPC notification. The `KNOWN_METHODS` set only covers requests, but line 113 extracts the method from notifications without checking against `KNOWN_METHODS`. A malicious or buggy client can inflate the time-series count by sending many distinct notification method names. private static String extractMethod(McpSchema.JSONRPCMessage message) { String method; if (message instanceof McpSchema.JSONRPCRequest) { method = ((McpSchema.JSONRPCRequest) message).method(); } else if (message instanceof McpSchema.JSONRPCNotification) { method = ((McpSchema.JSONRPCNotification) message).method(); } else { return "unknown"; } return KNOWN_METHODS.contains(method) ? method : "other"; } Possible NPE If a connector document is missing the `protocol` field, `hit.getSourceAsMap().get("protocol")` returns null, and calling `.toString()` on null throws a NullPointerException. This will fail the entire metrics collection run for MCP connectors. String protocol = hit.getSourceAsMap().get("protocol").toString();

github-actions · 2026-04-28T19:05:47Z

PR Code Suggestions ✨

Latest suggestions up to 625b272

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Handle null protocol field safely The code calls `.toString()` on `hit.getSourceAsMap().get("protocol")` without checking if the value is `null`, which will throw a `NullPointerException` if the `protocol` field is missing or null. Add a null check or provide a default value to handle missing fields safely. plugin/src/main/java/org/opensearch/ml/jobs/processors/MLStatsJobProcessor.java [278-294] client.search(searchRequest, new ActionListener<SearchResponse>() { @Override public void onResponse(SearchResponse searchResponse) { warnIfTruncated("MCP connector", searchResponse); for (SearchHit hit : searchResponse.getHits()) { - String protocol = hit.getSourceAsMap().get("protocol").toString(); + Object protocolObj = hit.getSourceAsMap().get("protocol"); + String protocol = protocolObj == null ? "unknown" : protocolObj.toString(); Tags tags = Tags.create().addTag("protocol", protocol); MLMcpConnectorMetricsCounter.getInstance().incrementCounter(McpConnectorMetric.MCP_CONNECTOR_COUNT, tags); } } ... }); Suggestion importance[1-10]: 8 __ Why: This is a valid concern. The code calls `.toString()` on `hit.getSourceAsMap().get("protocol")` without null checking, which would throw a `NullPointerException` if the `protocol` field is missing or null. The suggested fix mirrors the pattern already used for `tool_type` in `collectMcpServerRegisteredToolMetrics()` at line 314, making it consistent and safe.	Medium
	Add null check for method The `extractMethod` function may return `null` if `method()` returns `null` on a valid request or notification, causing a `NullPointerException` in `KNOWN_METHODS.contains(method)`. Add a null check before calling `contains()` to prevent crashes. plugin/src/main/java/org/opensearch/ml/action/mcpserver/TransportMcpServerAction.java [143-153] private static String extractMethod(McpSchema.JSONRPCMessage message) { String method; if (message instanceof McpSchema.JSONRPCRequest) { method = ((McpSchema.JSONRPCRequest) message).method(); } else if (message instanceof McpSchema.JSONRPCNotification) { method = ((McpSchema.JSONRPCNotification) message).method(); } else { return "unknown"; } - return KNOWN_METHODS.contains(method) ? method : "other"; + return (method != null && KNOWN_METHODS.contains(method)) ? method : "other"; } Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a potential `NullPointerException` if `method()` returns `null`. Adding a null check before `KNOWN_METHODS.contains(method)` prevents crashes and ensures the method returns `"other"` for null methods, which is appropriate defensive programming.	Medium
	Validate enum constants before streaming If `metricClass.getEnumConstants()` returns `null` (which happens when `metricClass` is not an enum), the `Stream.of()` call will throw a `NullPointerException`. Add a null check or validate that `metricClass` is an enum before processing to prevent crashes during initialization. common/src/main/java/org/opensearch/ml/stats/otel/counters/AbstractMLMetricsCounter.java [34-52] protected AbstractMLMetricsCounter( String clusterName, MetricsRegistry metricsRegistry, Class<T> metricClass, MLFeatureEnabledSetting mlFeatureEnabledSetting ) { this.clusterName = clusterName; this.metricsRegistry = metricsRegistry; this.metricCounterMap = new ConcurrentHashMap<>(); this.metricHistogramMap = new ConcurrentHashMap<>(); this.mlFeatureEnabledSetting = mlFeatureEnabledSetting; - Stream.of(metricClass.getEnumConstants()).forEach(metric -> { - if (getMetricType(metric) == MetricType.COUNTER) { - metricCounterMap.computeIfAbsent(metric, this::createMetricCounter); - } else if (getMetricType(metric) == MetricType.HISTOGRAM) { - metricHistogramMap.computeIfAbsent(metric, this::createMetricHistogram); - } - }); + T[] enumConstants = metricClass.getEnumConstants(); + if (enumConstants != null) { + Stream.of(enumConstants).forEach(metric -> { + if (getMetricType(metric) == MetricType.COUNTER) { + metricCounterMap.computeIfAbsent(metric, this::createMetricCounter); + } else if (getMetricType(metric) == MetricType.HISTOGRAM) { + metricHistogramMap.computeIfAbsent(metric, this::createMetricHistogram); + } + }); + } } Suggestion importance[1-10]: 6 __ Why: While the suggestion is technically correct that `getEnumConstants()` returns `null` for non-enum classes, this is a protected constructor in an abstract class with generic constraint `<T extends Enum<T>>`. In practice, callers pass enum classes (`McpConnectorMetric.class`, `McpServerMetric.class`), so the risk is low. However, adding the null check improves robustness and prevents potential misuse.	Low

Previous suggestions

Suggestions up to commit cc3d6dd

Category	Suggestion	Impact
General	Address incomplete metrics from truncation The `warnIfTruncated` method logs a warning when results are truncated, but metrics are still emitted for only the first batch. This can lead to misleading metrics that don't reflect the true count. Consider either implementing pagination to process all results or emitting a separate metric indicating truncation occurred. plugin/src/main/java/org/opensearch/ml/jobs/processors/MLStatsJobProcessor.java [327-337] private static void warnIfTruncated(String scanName, SearchResponse searchResponse) { if (searchResponse.getHits().getTotalHits() != null && searchResponse.getHits().getTotalHits().value() > BATCH_SIZE) { log .warn( "{} metrics scan truncated: total hits {} exceeds batch size {}; counts will be under-reported until pagination is added", scanName, searchResponse.getHits().getTotalHits().value(), BATCH_SIZE ); + // TODO: Implement pagination or emit a truncation metric } } Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that truncated results lead to under-reported metrics. While the improved code only adds a TODO comment rather than implementing a solution, it highlights a significant issue that affects metric accuracy and should be addressed.	Medium
	Cache singleton instance locally The `recordInvocation` method calls `getInstance()` twice, which is inefficient. Cache the singleton instance in a local variable to reduce redundant method calls and improve performance, especially in high-throughput scenarios. ml-algorithms/src/main/java/org/opensearch/ml/engine/tools/McpSseTool.java [81-88] private void recordInvocation(long startNanos, String status) { double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0; Tags tags = Tags.create().addTag("protocol", MCP_SSE).addTag("status", status); - MLMcpConnectorMetricsCounter.getInstance().incrementCounter(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_COUNT, tags); - MLMcpConnectorMetricsCounter - .getInstance() - .recordHistogram(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_LATENCY, latencyMs, tags); + MLMcpConnectorMetricsCounter counter = MLMcpConnectorMetricsCounter.getInstance(); + counter.incrementCounter(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_COUNT, tags); + counter.recordHistogram(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_LATENCY, latencyMs, tags); } Suggestion importance[1-10]: 5 __ Why: Caching `getInstance()` in a local variable reduces redundant method calls and improves code readability. While the performance gain is small (singleton access is typically fast), this is a valid optimization that enhances maintainability.	Low
	Capture end time immediately The `recordToolCall` method captures `startNanos` before the tool executes but records latency after the `ActionListener` callback fires, which may occur on a different thread. This can lead to inaccurate latency measurements if the callback is delayed. Consider capturing the end time immediately when the callback is invoked to ensure accurate latency tracking. plugin/src/main/java/org/opensearch/ml/action/mcpserver/McpToolsHelper.java [112-117] private static void recordToolCall(String toolType, String status, long startNanos) { - double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0; + long endNanos = System.nanoTime(); + double latencyMs = (endNanos - startNanos) / 1_000_000.0; Tags tags = Tags.create().addTag("tool_type", toolType).addTag("status", status); MLMcpServerMetricsCounter.getInstance().incrementCounter(McpServerMetric.MCP_SERVER_TOOL_CALL_COUNT, tags); MLMcpServerMetricsCounter.getInstance().recordHistogram(McpServerMetric.MCP_SERVER_TOOL_CALL_LATENCY, latencyMs, tags); } Suggestion importance[1-10]: 3 __ Why: The suggestion to capture `endNanos` immediately is technically correct but has minimal impact since `recordToolCall` is invoked synchronously within the callback. The latency measurement already reflects the tool execution time accurately. The improvement is marginal.	Low

Suggestions up to commit 3fed456

Category	Suggestion	Impact
Possible issue	Wrap tool invocation in try-catch The `run` call is not wrapped in a try-catch block. If `StringUtils.getParameterMap` or `actualTool.run` throws a synchronous exception before invoking the listener, the Mono sink will never complete, causing the client to hang indefinitely. plugin/src/main/java/org/opensearch/ml/action/mcpserver/McpToolsHelper.java [107] -actualTool.run(StringUtils.getParameterMap(request.arguments()), actionListener); +try { + actualTool.run(StringUtils.getParameterMap(request.arguments()), actionListener); +} catch (Exception e) { + recordToolCall(toolType, "failure", startNanos); + log.error("Failed to execute tool, tool name: {}", toolName, e); + sink.error(e); +} Suggestion importance[1-10]: 8 __ Why: If `StringUtils.getParameterMap` or `actualTool.run` throws a synchronous exception before the listener is invoked, the Mono sink will never complete, causing the client to hang. This is a critical issue that can lead to deadlocks.	Medium
Possible issue	Handle null protocol field safely If `hit.getSourceAsMap().get("protocol")` returns `null`, calling `toString()` will throw a `NullPointerException`, causing the entire metrics collection job to fail and preventing subsequent connectors from being counted. plugin/src/main/java/org/opensearch/ml/jobs/processors/MLStatsJobProcessor.java [281-284] for (SearchHit hit : searchResponse.getHits()) { - String protocol = hit.getSourceAsMap().get("protocol").toString(); + Object protocolObj = hit.getSourceAsMap().get("protocol"); + String protocol = protocolObj == null ? "unknown" : protocolObj.toString(); Tags tags = Tags.create().addTag("protocol", protocol); MLMcpConnectorMetricsCounter.getInstance().incrementCounter(McpConnectorMetric.MCP_CONNECTOR_COUNT, tags); } Suggestion importance[1-10]: 7 __ Why: The code calls `toString()` on `hit.getSourceAsMap().get("protocol")` without null-checking, which would throw `NullPointerException` if the field is missing. This would cause the entire metrics collection job to fail. The suggestion correctly mirrors the pattern already used for `toolType` at line 312.	Medium
General	Record notification processing failures Notifications are recorded as "success" even if the server fails to process them internally. Since notifications are fire-and-forget, any processing error goes unreported, skewing metrics and hiding failures. plugin/src/main/java/org/opensearch/ml/action/mcpserver/TransportMcpServerAction.java [112-116] if (message instanceof McpSchema.JSONRPCNotification) { - recordRequest(method, "success"); + try { + transportProvider.handleRequest(message).subscribe( + response -> recordRequest(method, "success"), + error -> recordRequest(method, "failure") + ); + } catch (Exception e) { + recordRequest(method, "failure"); + } listener.onResponse(new MLMcpServerResponse(true, null, null)); return; } Suggestion importance[1-10]: 4 __ Why: While the suggestion correctly identifies that notification processing errors are not tracked, the current implementation treats notifications as fire-and-forget per JSON-RPC spec. The suggested change would add complexity without clear benefit, as notifications don't expect responses.	Low

Suggestions up to commit 32ceee9

Category	Suggestion	Impact
General	Guard against negative latency values The `recordToolCall` method is invoked from both success and failure branches, but if `System.nanoTime()` wraps or the clock is adjusted, the computed latency could be negative or nonsensical. Add a guard to ensure `latencyMs` is non-negative before recording the histogram to prevent misleading metrics. plugin/src/main/java/org/opensearch/ml/action/mcpserver/McpToolsHelper.java [112-117] private static void recordToolCall(String toolType, String status, long startNanos) { double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0; + if (latencyMs < 0) { + latencyMs = 0; + } Tags tags = Tags.create().addTag("tool_type", toolType).addTag("status", status); MLMcpServerMetricsCounter.getInstance().incrementCounter(McpServerMetric.MCP_SERVER_TOOL_CALL_COUNT, tags); MLMcpServerMetricsCounter.getInstance().recordHistogram(McpServerMetric.MCP_SERVER_TOOL_CALL_LATENCY, latencyMs, tags); } Suggestion importance[1-10]: 5 __ Why: While `System.nanoTime()` is monotonic and shouldn't wrap in practice, adding a guard against negative latency is a defensive programming practice that prevents potential metric corruption. However, this is a low-probability edge case with moderate impact.	Low
	Prevent negative latency in histogram If `System.nanoTime()` wraps or the system clock is adjusted, `latencyMs` could become negative. Recording a negative histogram value can corrupt metrics. Add a check to clamp `latencyMs` to zero if it's negative before recording. ml-algorithms/src/main/java/org/opensearch/ml/engine/tools/McpSseTool.java [81-88] private void recordInvocation(long startNanos, String status) { double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0; + if (latencyMs < 0) { + latencyMs = 0; + } Tags tags = Tags.create().addTag("protocol", MCP_SSE).addTag("status", status); MLMcpConnectorMetricsCounter.getInstance().incrementCounter(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_COUNT, tags); MLMcpConnectorMetricsCounter .getInstance() .recordHistogram(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_LATENCY, latencyMs, tags); } Suggestion importance[1-10]: 5 __ Why: The suggestion adds defensive programming to prevent negative latency values from being recorded. While `System.nanoTime()` is monotonic and this scenario is unlikely, the guard provides safety against metric corruption at minimal cost.	Low
	Clamp negative latency to zero The computed `latencyMs` can be negative if `System.nanoTime()` wraps or the clock is adjusted. Recording a negative value in a histogram can corrupt metrics. Add a guard to ensure `latencyMs` is non-negative before recording. ml-algorithms/src/main/java/org/opensearch/ml/engine/tools/McpStreamableHttpTool.java [82-89] private void recordInvocation(long startNanos, String status) { double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0; + if (latencyMs < 0) { + latencyMs = 0; + } Tags tags = Tags.create().addTag("protocol", MCP_STREAMABLE_HTTP).addTag("status", status); MLMcpConnectorMetricsCounter.getInstance().incrementCounter(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_COUNT, tags); MLMcpConnectorMetricsCounter .getInstance() .recordHistogram(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_LATENCY, latencyMs, tags); } Suggestion importance[1-10]: 5 __ Why: Similar to the previous suggestions, this adds a defensive check against negative latency values. While unlikely with `System.nanoTime()`, it's a reasonable safeguard that prevents potential histogram corruption with minimal overhead.	Low

Suggestions up to commit 463d8ff

Category	Suggestion	Impact
Possible issue	Guard metrics recording against uninitialized singleton `MLMcpServerMetricsCounter.getInstance()` throws `IllegalStateException` if the counter has not been initialized. If `recordToolCall` is invoked before initialization (e.g., in tests or during early startup), it will crash the request handler. Cache the instance in a local variable and wrap the call in a try-catch or null-check to prevent an unhandled exception from propagating to the reactive sink. plugin/src/main/java/org/opensearch/ml/action/mcpserver/McpToolsHelper.java [112-119] private static void recordToolCall(String toolType, String status, long startNanos) { - double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0; - Tags countTags = Tags.create().addTag("tool_type", toolType).addTag("status", status); - MLMcpServerMetricsCounter.getInstance().incrementCounter(McpServerMetric.MCP_SERVER_TOOL_CALL_COUNT, countTags); - MLMcpServerMetricsCounter - .getInstance() - .recordHistogram(McpServerMetric.MCP_SERVER_TOOL_CALL_LATENCY, latencyMs, Tags.create().addTag("tool_type", toolType)); + try { + double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0; + MLMcpServerMetricsCounter counter = MLMcpServerMetricsCounter.getInstance(); + Tags countTags = Tags.create().addTag("tool_type", toolType).addTag("status", status); + counter.incrementCounter(McpServerMetric.MCP_SERVER_TOOL_CALL_COUNT, countTags); + counter.recordHistogram(McpServerMetric.MCP_SERVER_TOOL_CALL_LATENCY, latencyMs, Tags.create().addTag("tool_type", toolType)); + } catch (IllegalStateException e) { + log.warn("MCP server metrics counter not initialized, skipping metric recording", e); + } } Suggestion importance[1-10]: 5 __ Why: The `getInstance()` call can throw `IllegalStateException` if not initialized, but the PR already initializes all counters unconditionally in `MachineLearningPlugin.createComponents()`, making this a low-probability edge case. The suggestion is valid but has limited practical impact given the initialization pattern used.	Low
General	Warn when paginated metrics collection is incomplete `collectMcpConnectorMetrics` only fetches the first `BATCH_SIZE` documents but emits one counter increment per hit, so if there are more connectors than `BATCH_SIZE` the remaining ones are silently ignored. The method should either use scroll/search-after pagination or at minimum log a warning when the total hits exceed the fetched page size, to avoid under-counting. plugin/src/main/java/org/opensearch/ml/jobs/processors/MLStatsJobProcessor.java [276-292] client.search(searchRequest, new ActionListener<SearchResponse>() { @Override public void onResponse(SearchResponse searchResponse) { + long totalHits = searchResponse.getHits().getTotalHits().value; + if (totalHits > BATCH_SIZE) { + log.warn("MCP connector count ({}) exceeds batch size ({}); metrics may be under-counted", totalHits, BATCH_SIZE); + } for (SearchHit hit : searchResponse.getHits()) { Object protocolObj = hit.getSourceAsMap().get("protocol"); String protocol = protocolObj == null ? "unknown" : protocolObj.toString(); Tags tags = Tags.create().addTag("protocol", protocol); MLMcpConnectorMetricsCounter.getInstance().incrementCounter(McpConnectorMetric.MCP_CONNECTOR_COUNT, tags); } } Suggestion importance[1-10]: 5 __ Why: The truncation issue is real — `collectMcpConnectorMetrics` silently under-counts when connectors exceed `BATCH_SIZE`. Adding a warning log is a reasonable minimal fix, though full pagination would be more correct. The `improved_code` accurately reflects the suggested change.	Low
General	Avoid duplicate getInstance calls and guard uninitialized singleton `MLMcpConnectorMetricsCounter.getInstance()` is called twice, which doubles the synchronization overhead and risks an `IllegalStateException` if the singleton is not initialized. Store the instance in a local variable and wrap the block in a try-catch so a missing initialization does not propagate as an unhandled exception to the caller. ml-algorithms/src/main/java/org/opensearch/ml/engine/tools/McpSseTool.java [81-92] private void recordInvocation(long startNanos, String status) { - double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0; - Tags tags = Tags.create().addTag("protocol", MCP_SSE).addTag("status", status); - MLMcpConnectorMetricsCounter.getInstance().incrementCounter(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_COUNT, tags); - MLMcpConnectorMetricsCounter - .getInstance() - .recordHistogram( + try { + double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0; + MLMcpConnectorMetricsCounter counter = MLMcpConnectorMetricsCounter.getInstance(); + Tags tags = Tags.create().addTag("protocol", MCP_SSE).addTag("status", status); + counter.incrementCounter(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_COUNT, tags); + counter.recordHistogram( McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_LATENCY, latencyMs, Tags.create().addTag("protocol", MCP_SSE) ); + } catch (IllegalStateException e) { + log.warn("MCP connector metrics counter not initialized, skipping metric recording", e); + } } Suggestion importance[1-10]: 4 __ Why: Caching the instance in a local variable is a minor improvement to avoid double synchronization, but the `getInstance()` method uses `synchronized` which has low overhead. The uninitialized guard is valid but unlikely given the plugin's initialization pattern.	Low

codecov · 2026-04-28T20:04:44Z

Codecov Report

❌ Patch coverage is 91.72932% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.45%. Comparing base (12f884e) to head (463d8ff).

Files with missing lines	Patch %	Lines
...search/ml/jobs/processors/MLStatsJobProcessor.java	84.61%	4 Missing and 2 partials ⚠️
...opensearch/ml/action/mcpserver/McpToolsHelper.java	81.25%	3 Missing ⚠️
.../ml/action/mcpserver/TransportMcpServerAction.java	86.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #4803      +/-   ##
============================================
+ Coverage     77.42%   77.45%   +0.03%     
- Complexity    11907    11933      +26     
============================================
  Files           963      967       +4     
  Lines         53326    53450     +124     
  Branches       6503     6507       +4     
============================================
+ Hits          41285    41401     +116     
- Misses         9289     9298       +9     
+ Partials       2752     2751       -1

Flag	Coverage Δ
ml-commons	`77.45% <91.72%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-05-26T20:27:46Z

Persistent review updated to latest commit 32ceee9

Adds OTel-backed metrics for MCP connectors (client-side) and the ml-commons MCP server (server-side), following the pattern established in opensearch-project#4221. New metrics: - ml.commons.MCP_CONNECTOR_COUNT (adoption, tag: protocol) - ml.commons.MCP_CONNECTOR_TOOL_INVOCATION_COUNT/_LATENCY (operational, tags: protocol, status) - ml.commons.MCP_SERVER_REGISTERED_TOOL_COUNT (adoption, tag: tool_type) - ml.commons.MCP_SERVER_TOOL_CALL_COUNT/_LATENCY (operational, tags: tool_type, status) - ml.commons.MCP_SERVER_REQUEST_COUNT (operational, tags: method, status) Adoption metrics are emitted by MLStatsJobProcessor via periodic index scans; operational metrics are emitted at call sites. "Domains using MCP" is derived upstream from the auto-injected cluster_name tag. Moves AbstractMLMetricsCounter and MetricType from plugin to common so ml-algorithms can reference the counter singletons. Also removes the isMetricCollectionEnabled() gate around counter initialization in MachineLearningPlugin. Counter emission is already gated inside the abstract base, so initializing eagerly makes getInstance() safe at all call sites and fixes a latent IllegalStateException when metric collection is disabled. Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

- Add testMetricsEmittedOnSuccessAndFailure to McpStreamableHttpToolTests (mirrors the SSE test; both tools have independent recordInvocation paths) - Expand MachineLearningPlugin init comment with the full repro of the IllegalStateException bug that motivated eager counter initialization - Add MLMcpMetricsCountersTests in common/src/test covering the MCP counter singleton scaffolding (JaCoCo reports per-module; call-site tests in other modules don't cover the new counter classes in common) - Drop redundant tearDown reset in MLStatsJobProcessorTests (setUp reset is sufficient and matches the pattern used by MLAdoptionMetricsCounterTests) Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

- Histogram tags: pass the same protocol+status / tool_type+status Tags object to recordHistogram so latency can be sliced by success vs failure (McpSseTool, McpStreamableHttpTool, McpToolsHelper). - TransportMcpServerAction.extractMethod: collapse anything outside the server-served allowlist (initialize, notifications/initialized, ping, tools/list, tools/call, logging/setLevel) to "other" so untrusted client input cannot inflate the method-tag cardinality. - MLStatsJobProcessor: log a warning when a metrics scan returns more hits than BATCH_SIZE so operators notice silent under-counting until pagination is added. Drop the unreachable "unknown" protocol fallback in the connector scan (search query already filters on protocol). - MachineLearningPlugin: hoist clusterName local across the four counter initialize() calls. - Tests: assert the histogram receives the status tag in the SSE/StreamableHttp tool tests; add a TransportMcpServerAction test proving an off-allowlist method is tagged as "other". Signed-off-by: rithinpullela <rithinp@amazon.com>

github-actions · 2026-05-27T14:06:49Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 3fed456.

Path	Line	Severity	Description
common/src/main/java/org/opensearch/ml/stats/otel/counters/MLMcpConnectorMetricsCounter.java	47	low	Public static reset() methods on production singleton classes (MLMcpConnectorMetricsCounter and MLMcpServerMetricsCounter) are documented as test-only but are accessible from any caller in the same JVM. If invoked in a running cluster, they null out the singleton and cause subsequent getInstance() calls to throw IllegalStateException, disrupting metrics collection. The pattern is consistent with existing counters in the codebase and poses no direct data exfiltration risk, but the lack of access restriction is an anomaly worth noting.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

github-actions · 2026-05-27T14:07:53Z

Persistent review updated to latest commit 3fed456

Signed-off-by: rithinpullela <rithinp@amazon.com>

github-actions · 2026-05-27T14:10:52Z

Persistent review updated to latest commit cc3d6dd

Extract recordInvocation duplication between McpSseTool and McpStreamableHttpTool into a package-private McpToolMetrics helper. Strengthen MLMcpMetricsCountersTests to mirror the MLAdoptionMetricsCounterTests / MLOperationalMetricsCounterTests pattern (registry creation counts, increment/record routing, feature-flag toggle). Signed-off-by: rithinpullela <rithinp@amazon.com>

github-actions · 2026-05-27T14:55:28Z

Persistent review updated to latest commit 625b272

rithinpullela requested review from HenryL27, Zhangxunmt, akolarkunnu, austintlee, b4sjoo, dhrubo-os, jngz-es, mingshl, model-collapse, pyek-bot, rbhavna, sam-herman, xinyual, ylwu-amzn and zane-neo as code owners April 28, 2026 19:03

rithinpullela temporarily deployed to ml-commons-cicd-env-require-approval April 28, 2026 19:05 — with GitHub Actions Inactive

rithinpullela had a problem deploying to ml-commons-cicd-env-require-approval April 28, 2026 19:05 — with GitHub Actions Error

rithinpullela temporarily deployed to ml-commons-cicd-env-require-approval April 28, 2026 19:05 — with GitHub Actions Inactive

rithinpullela had a problem deploying to ml-commons-cicd-env-require-approval April 28, 2026 19:05 — with GitHub Actions Failure

rithinpullela temporarily deployed to ml-commons-cicd-env-require-approval April 28, 2026 20:05 — with GitHub Actions Inactive

nathaliellenaa reviewed May 21, 2026

View reviewed changes

Comment thread ml-algorithms/src/main/java/org/opensearch/ml/engine/tools/McpSseTool.java Outdated

Comment thread plugin/src/main/java/org/opensearch/ml/jobs/processors/MLStatsJobProcessor.java

ylwu-amzn reviewed May 26, 2026

View reviewed changes

rithinpullela added 3 commits May 27, 2026 07:05

Trim init comment in MachineLearningPlugin

f0f8ab0

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

rithinpullela force-pushed the mcp-metrics branch from 32ceee9 to 3fed456 Compare May 27, 2026 14:05

rithinpullela had a problem deploying to ml-commons-cicd-env May 27, 2026 14:08 — with GitHub Actions Failure

rithinpullela had a problem deploying to ml-commons-cicd-env May 27, 2026 14:08 — with GitHub Actions Error

Extend warnIfTruncated to model and agent metric scans

cc3d6dd

Signed-off-by: rithinpullela <rithinp@amazon.com>

rithinpullela had a problem deploying to ml-commons-cicd-env May 27, 2026 14:10 — with GitHub Actions Failure

rithinpullela had a problem deploying to ml-commons-cicd-env May 27, 2026 14:10 — with GitHub Actions Error

rithinpullela had a problem deploying to ml-commons-cicd-env May 27, 2026 14:55 — with GitHub Actions Failure

rithinpullela had a problem deploying to ml-commons-cicd-env May 27, 2026 14:55 — with GitHub Actions Error

rithinpullela had a problem deploying to ml-commons-cicd-env May 27, 2026 14:59 — with GitHub Actions Failure

rithinpullela had a problem deploying to ml-commons-cicd-env May 27, 2026 14:59 — with GitHub Actions Error

Conversation

rithinpullela commented Apr 28, 2026

Description

New metrics

Module move

Bug fix included

Verification

Related Issues

Check List

Uh oh!

github-actions Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 625b272)

Uh oh!

github-actions Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

codecov Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

github-actions Bot commented May 27, 2026

PR Code Analyzer ❗

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Apr 28, 2026 •

edited

Loading

(Review updated until commit `625b272`)

github-actions Bot commented Apr 28, 2026 •

edited

Loading

codecov Bot commented Apr 28, 2026 •

edited

Loading