Skip to content

feat: Add MCP connector and MCP server metrics#4803

Open
rithinpullela wants to merge 6 commits into
opensearch-project:mainfrom
rithinpullela:mcp-metrics
Open

feat: Add MCP connector and MCP server metrics#4803
rithinpullela wants to merge 6 commits into
opensearch-project:mainfrom
rithinpullela:mcp-metrics

Conversation

@rithinpullela
Copy link
Copy Markdown
Collaborator

Description

Adds OTel-backed metrics for MCP connectors (client-side — used by agents to call external MCP servers) and the ml-commons MCP server (server-side — exposes ml-commons tools over MCP). Follows the metrics pattern established in #4221 (MLAdoptionMetricsCounter / MLOperationalMetricsCounter).

New metrics

Adoption (periodic scan in MLStatsJobProcessor):

Metric Tags
ml.commons.MCP_CONNECTOR_COUNT protocol (mcp_sse / mcp_streamable_http)
ml.commons.MCP_SERVER_REGISTERED_TOOL_COUNT tool_type

Operational (emitted at call sites):

Metric Tags
ml.commons.MCP_CONNECTOR_TOOL_INVOCATION_COUNT / _LATENCY protocol, status
ml.commons.MCP_SERVER_TOOL_CALL_COUNT / _LATENCY tool_type, status
ml.commons.MCP_SERVER_REQUEST_COUNT method, status

cluster_name is auto-injected by AbstractMLMetricsCounter, so "# domains using MCP connector/server" is derived upstream from COUNT(DISTINCT cluster_name) — same pattern as the existing MODEL_COUNT/AGENT_COUNT metrics.

The method tag on MCP_SERVER_REQUEST_COUNT captures the JSON-RPC method (initialize, tools/list, tools/call, notifications/initialized, or unknown on parse failure). Cardinality is bounded by the MCP spec's fixed method set; the tag makes the difference between "clients browsing the catalog" and "clients actually executing tools" visible.

Module move

AbstractMLMetricsCounter and MetricType are moved from plugin/ to common/ (same package path). Reason: ml-algorithms hosts McpSseTool / McpStreamableHttpTool and can't depend on plugin. Existing MLAdoptionMetricsCounter / MLOperationalMetricsCounter remain in plugin/ untouched.

Bug fix included

Removes the if (mlFeatureEnabledSetting.isMetricCollectionEnabled()) gate around counter initialization in MachineLearningPlugin. That guard only skipped initialize() — every runtime call site (e.g. MLPredictTaskRunner, and now the MCP call sites) calls getInstance() unconditionally. On the default install where plugins.ml_commons.metrics_collection_enabled=false (a Final setting), the counters were never initialized, so getInstance() threw IllegalStateException inside request handlers and broke POST /_plugins/_ml/mcp entirely.

Reproduced live: first tools/list call on a stock cluster returned JSON_RPC_INTERNAL_ERROR with "MLMcpServerMetricsCounter is not initialized". Fix: always initialize() the counters; emission is already gated inside incrementCounter / recordHistogram by the same feature flag, so no behavior change when metrics are off. Added test_doExecute_succeedsWhenMetricCollectionDisabled as a regression guard. This latent bug also existed for the pre-existing MLOperationalMetricsCounter in the predict path; the fix covers both.

Verification

  • Unit tests added/extended: counter scaffolding (MLMcpMetricsCountersTests in common), tool invocation (McpSseToolTests, McpStreamableHttpToolTests), server request (TransportMcpServerActionTests), tool-call wrapping (McpToolsHelperTests), scan collectors (MLStatsJobProcessorTests).
  • Manually verified on a local cluster (./gradlew :opensearch-ml-plugin:run): registered a tool, fired tools/list and tools/call, observed correct emission with expected tags in both metrics-enabled and metrics-disabled configurations.

Related Issues

N/A

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 28, 2026

PR Reviewer Guide 🔍

(Review updated until commit 625b272)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Unbounded cardinality

The method tag on MCP_SERVER_REQUEST_COUNT can still grow unbounded if a client sends arbitrary strings in the method field of a JSON-RPC notification. The KNOWN_METHODS set only covers requests, but line 113 extracts the method from notifications without checking against KNOWN_METHODS. A malicious or buggy client can inflate the time-series count by sending many distinct notification method names.

private static String extractMethod(McpSchema.JSONRPCMessage message) {
    String method;
    if (message instanceof McpSchema.JSONRPCRequest) {
        method = ((McpSchema.JSONRPCRequest) message).method();
    } else if (message instanceof McpSchema.JSONRPCNotification) {
        method = ((McpSchema.JSONRPCNotification) message).method();
    } else {
        return "unknown";
    }
    return KNOWN_METHODS.contains(method) ? method : "other";
}
Possible NPE

If a connector document is missing the protocol field, hit.getSourceAsMap().get("protocol") returns null, and calling .toString() on null throws a NullPointerException. This will fail the entire metrics collection run for MCP connectors.

String protocol = hit.getSourceAsMap().get("protocol").toString();

@rithinpullela rithinpullela temporarily deployed to ml-commons-cicd-env-require-approval April 28, 2026 19:05 — with GitHub Actions Inactive
@rithinpullela rithinpullela had a problem deploying to ml-commons-cicd-env-require-approval April 28, 2026 19:05 — with GitHub Actions Error
@rithinpullela rithinpullela temporarily deployed to ml-commons-cicd-env-require-approval April 28, 2026 19:05 — with GitHub Actions Inactive
@rithinpullela rithinpullela had a problem deploying to ml-commons-cicd-env-require-approval April 28, 2026 19:05 — with GitHub Actions Failure
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 28, 2026

PR Code Suggestions ✨

Latest suggestions up to 625b272

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Handle null protocol field safely

The code calls .toString() on hit.getSourceAsMap().get("protocol") without checking
if the value is null, which will throw a NullPointerException if the protocol field
is missing or null. Add a null check or provide a default value to handle missing
fields safely.

plugin/src/main/java/org/opensearch/ml/jobs/processors/MLStatsJobProcessor.java [278-294]

 client.search(searchRequest, new ActionListener<SearchResponse>() {
     @Override
     public void onResponse(SearchResponse searchResponse) {
         warnIfTruncated("MCP connector", searchResponse);
         for (SearchHit hit : searchResponse.getHits()) {
-            String protocol = hit.getSourceAsMap().get("protocol").toString();
+            Object protocolObj = hit.getSourceAsMap().get("protocol");
+            String protocol = protocolObj == null ? "unknown" : protocolObj.toString();
             Tags tags = Tags.create().addTag("protocol", protocol);
             MLMcpConnectorMetricsCounter.getInstance().incrementCounter(McpConnectorMetric.MCP_CONNECTOR_COUNT, tags);
         }
     }
     ...
 });
Suggestion importance[1-10]: 8

__

Why: This is a valid concern. The code calls .toString() on hit.getSourceAsMap().get("protocol") without null checking, which would throw a NullPointerException if the protocol field is missing or null. The suggested fix mirrors the pattern already used for tool_type in collectMcpServerRegisteredToolMetrics() at line 314, making it consistent and safe.

Medium
Add null check for method

The extractMethod function may return null if method() returns null on a valid
request or notification, causing a NullPointerException in
KNOWN_METHODS.contains(method). Add a null check before calling contains() to
prevent crashes.

plugin/src/main/java/org/opensearch/ml/action/mcpserver/TransportMcpServerAction.java [143-153]

 private static String extractMethod(McpSchema.JSONRPCMessage message) {
     String method;
     if (message instanceof McpSchema.JSONRPCRequest) {
         method = ((McpSchema.JSONRPCRequest) message).method();
     } else if (message instanceof McpSchema.JSONRPCNotification) {
         method = ((McpSchema.JSONRPCNotification) message).method();
     } else {
         return "unknown";
     }
-    return KNOWN_METHODS.contains(method) ? method : "other";
+    return (method != null && KNOWN_METHODS.contains(method)) ? method : "other";
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential NullPointerException if method() returns null. Adding a null check before KNOWN_METHODS.contains(method) prevents crashes and ensures the method returns "other" for null methods, which is appropriate defensive programming.

Medium
Validate enum constants before streaming

If metricClass.getEnumConstants() returns null (which happens when metricClass is
not an enum), the Stream.of() call will throw a NullPointerException. Add a null
check or validate that metricClass is an enum before processing to prevent crashes
during initialization.

common/src/main/java/org/opensearch/ml/stats/otel/counters/AbstractMLMetricsCounter.java [34-52]

 protected AbstractMLMetricsCounter(
     String clusterName,
     MetricsRegistry metricsRegistry,
     Class<T> metricClass,
     MLFeatureEnabledSetting mlFeatureEnabledSetting
 ) {
     this.clusterName = clusterName;
     this.metricsRegistry = metricsRegistry;
     this.metricCounterMap = new ConcurrentHashMap<>();
     this.metricHistogramMap = new ConcurrentHashMap<>();
     this.mlFeatureEnabledSetting = mlFeatureEnabledSetting;
-    Stream.of(metricClass.getEnumConstants()).forEach(metric -> {
-        if (getMetricType(metric) == MetricType.COUNTER) {
-            metricCounterMap.computeIfAbsent(metric, this::createMetricCounter);
-        } else if (getMetricType(metric) == MetricType.HISTOGRAM) {
-            metricHistogramMap.computeIfAbsent(metric, this::createMetricHistogram);
-        }
-    });
+    T[] enumConstants = metricClass.getEnumConstants();
+    if (enumConstants != null) {
+        Stream.of(enumConstants).forEach(metric -> {
+            if (getMetricType(metric) == MetricType.COUNTER) {
+                metricCounterMap.computeIfAbsent(metric, this::createMetricCounter);
+            } else if (getMetricType(metric) == MetricType.HISTOGRAM) {
+                metricHistogramMap.computeIfAbsent(metric, this::createMetricHistogram);
+            }
+        });
+    }
 }
Suggestion importance[1-10]: 6

__

Why: While the suggestion is technically correct that getEnumConstants() returns null for non-enum classes, this is a protected constructor in an abstract class with generic constraint <T extends Enum<T>>. In practice, callers pass enum classes (McpConnectorMetric.class, McpServerMetric.class), so the risk is low. However, adding the null check improves robustness and prevents potential misuse.

Low

Previous suggestions

Suggestions up to commit cc3d6dd
CategorySuggestion                                                                                                                                    Impact
General
Address incomplete metrics from truncation

The warnIfTruncated method logs a warning when results are truncated, but metrics
are still emitted for only the first batch. This can lead to misleading metrics that
don't reflect the true count. Consider either implementing pagination to process all
results or emitting a separate metric indicating truncation occurred.

plugin/src/main/java/org/opensearch/ml/jobs/processors/MLStatsJobProcessor.java [327-337]

 private static void warnIfTruncated(String scanName, SearchResponse searchResponse) {
     if (searchResponse.getHits().getTotalHits() != null && searchResponse.getHits().getTotalHits().value() > BATCH_SIZE) {
         log
             .warn(
                 "{} metrics scan truncated: total hits {} exceeds batch size {}; counts will be under-reported until pagination is added",
                 scanName,
                 searchResponse.getHits().getTotalHits().value(),
                 BATCH_SIZE
             );
+        // TODO: Implement pagination or emit a truncation metric
     }
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that truncated results lead to under-reported metrics. While the improved code only adds a TODO comment rather than implementing a solution, it highlights a significant issue that affects metric accuracy and should be addressed.

Medium
Cache singleton instance locally

The recordInvocation method calls getInstance() twice, which is inefficient. Cache
the singleton instance in a local variable to reduce redundant method calls and
improve performance, especially in high-throughput scenarios.

ml-algorithms/src/main/java/org/opensearch/ml/engine/tools/McpSseTool.java [81-88]

 private void recordInvocation(long startNanos, String status) {
     double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0;
     Tags tags = Tags.create().addTag("protocol", MCP_SSE).addTag("status", status);
-    MLMcpConnectorMetricsCounter.getInstance().incrementCounter(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_COUNT, tags);
-    MLMcpConnectorMetricsCounter
-        .getInstance()
-        .recordHistogram(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_LATENCY, latencyMs, tags);
+    MLMcpConnectorMetricsCounter counter = MLMcpConnectorMetricsCounter.getInstance();
+    counter.incrementCounter(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_COUNT, tags);
+    counter.recordHistogram(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_LATENCY, latencyMs, tags);
 }
Suggestion importance[1-10]: 5

__

Why: Caching getInstance() in a local variable reduces redundant method calls and improves code readability. While the performance gain is small (singleton access is typically fast), this is a valid optimization that enhances maintainability.

Low
Capture end time immediately

The recordToolCall method captures startNanos before the tool executes but records
latency after the ActionListener callback fires, which may occur on a different
thread. This can lead to inaccurate latency measurements if the callback is delayed.
Consider capturing the end time immediately when the callback is invoked to ensure
accurate latency tracking.

plugin/src/main/java/org/opensearch/ml/action/mcpserver/McpToolsHelper.java [112-117]

 private static void recordToolCall(String toolType, String status, long startNanos) {
-    double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0;
+    long endNanos = System.nanoTime();
+    double latencyMs = (endNanos - startNanos) / 1_000_000.0;
     Tags tags = Tags.create().addTag("tool_type", toolType).addTag("status", status);
     MLMcpServerMetricsCounter.getInstance().incrementCounter(McpServerMetric.MCP_SERVER_TOOL_CALL_COUNT, tags);
     MLMcpServerMetricsCounter.getInstance().recordHistogram(McpServerMetric.MCP_SERVER_TOOL_CALL_LATENCY, latencyMs, tags);
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion to capture endNanos immediately is technically correct but has minimal impact since recordToolCall is invoked synchronously within the callback. The latency measurement already reflects the tool execution time accurately. The improvement is marginal.

Low
Suggestions up to commit 3fed456
CategorySuggestion                                                                                                                                    Impact
Possible issue
Wrap tool invocation in try-catch

The run call is not wrapped in a try-catch block. If StringUtils.getParameterMap or
actualTool.run throws a synchronous exception before invoking the listener, the Mono
sink will never complete, causing the client to hang indefinitely.

plugin/src/main/java/org/opensearch/ml/action/mcpserver/McpToolsHelper.java [107]

-actualTool.run(StringUtils.getParameterMap(request.arguments()), actionListener);
+try {
+    actualTool.run(StringUtils.getParameterMap(request.arguments()), actionListener);
+} catch (Exception e) {
+    recordToolCall(toolType, "failure", startNanos);
+    log.error("Failed to execute tool, tool name: {}", toolName, e);
+    sink.error(e);
+}
Suggestion importance[1-10]: 8

__

Why: If StringUtils.getParameterMap or actualTool.run throws a synchronous exception before the listener is invoked, the Mono sink will never complete, causing the client to hang. This is a critical issue that can lead to deadlocks.

Medium
Handle null protocol field safely

If hit.getSourceAsMap().get("protocol") returns null, calling toString() will throw
a NullPointerException, causing the entire metrics collection job to fail and
preventing subsequent connectors from being counted.

plugin/src/main/java/org/opensearch/ml/jobs/processors/MLStatsJobProcessor.java [281-284]

 for (SearchHit hit : searchResponse.getHits()) {
-    String protocol = hit.getSourceAsMap().get("protocol").toString();
+    Object protocolObj = hit.getSourceAsMap().get("protocol");
+    String protocol = protocolObj == null ? "unknown" : protocolObj.toString();
     Tags tags = Tags.create().addTag("protocol", protocol);
     MLMcpConnectorMetricsCounter.getInstance().incrementCounter(McpConnectorMetric.MCP_CONNECTOR_COUNT, tags);
 }
Suggestion importance[1-10]: 7

__

Why: The code calls toString() on hit.getSourceAsMap().get("protocol") without null-checking, which would throw NullPointerException if the field is missing. This would cause the entire metrics collection job to fail. The suggestion correctly mirrors the pattern already used for toolType at line 312.

Medium
General
Record notification processing failures

Notifications are recorded as "success" even if the server fails to process them
internally. Since notifications are fire-and-forget, any processing error goes
unreported, skewing metrics and hiding failures.

plugin/src/main/java/org/opensearch/ml/action/mcpserver/TransportMcpServerAction.java [112-116]

 if (message instanceof McpSchema.JSONRPCNotification) {
-    recordRequest(method, "success");
+    try {
+        transportProvider.handleRequest(message).subscribe(
+            response -> recordRequest(method, "success"),
+            error -> recordRequest(method, "failure")
+        );
+    } catch (Exception e) {
+        recordRequest(method, "failure");
+    }
     listener.onResponse(new MLMcpServerResponse(true, null, null));
     return;
 }
Suggestion importance[1-10]: 4

__

Why: While the suggestion correctly identifies that notification processing errors are not tracked, the current implementation treats notifications as fire-and-forget per JSON-RPC spec. The suggested change would add complexity without clear benefit, as notifications don't expect responses.

Low
Suggestions up to commit 32ceee9
CategorySuggestion                                                                                                                                    Impact
General
Guard against negative latency values

The recordToolCall method is invoked from both success and failure branches, but if
System.nanoTime() wraps or the clock is adjusted, the computed latency could be
negative or nonsensical. Add a guard to ensure latencyMs is non-negative before
recording the histogram to prevent misleading metrics.

plugin/src/main/java/org/opensearch/ml/action/mcpserver/McpToolsHelper.java [112-117]

 private static void recordToolCall(String toolType, String status, long startNanos) {
     double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0;
+    if (latencyMs < 0) {
+        latencyMs = 0;
+    }
     Tags tags = Tags.create().addTag("tool_type", toolType).addTag("status", status);
     MLMcpServerMetricsCounter.getInstance().incrementCounter(McpServerMetric.MCP_SERVER_TOOL_CALL_COUNT, tags);
     MLMcpServerMetricsCounter.getInstance().recordHistogram(McpServerMetric.MCP_SERVER_TOOL_CALL_LATENCY, latencyMs, tags);
 }
Suggestion importance[1-10]: 5

__

Why: While System.nanoTime() is monotonic and shouldn't wrap in practice, adding a guard against negative latency is a defensive programming practice that prevents potential metric corruption. However, this is a low-probability edge case with moderate impact.

Low
Prevent negative latency in histogram

If System.nanoTime() wraps or the system clock is adjusted, latencyMs could become
negative. Recording a negative histogram value can corrupt metrics. Add a check to
clamp latencyMs to zero if it's negative before recording.

ml-algorithms/src/main/java/org/opensearch/ml/engine/tools/McpSseTool.java [81-88]

 private void recordInvocation(long startNanos, String status) {
     double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0;
+    if (latencyMs < 0) {
+        latencyMs = 0;
+    }
     Tags tags = Tags.create().addTag("protocol", MCP_SSE).addTag("status", status);
     MLMcpConnectorMetricsCounter.getInstance().incrementCounter(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_COUNT, tags);
     MLMcpConnectorMetricsCounter
         .getInstance()
         .recordHistogram(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_LATENCY, latencyMs, tags);
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion adds defensive programming to prevent negative latency values from being recorded. While System.nanoTime() is monotonic and this scenario is unlikely, the guard provides safety against metric corruption at minimal cost.

Low
Clamp negative latency to zero

The computed latencyMs can be negative if System.nanoTime() wraps or the clock is
adjusted. Recording a negative value in a histogram can corrupt metrics. Add a guard
to ensure latencyMs is non-negative before recording.

ml-algorithms/src/main/java/org/opensearch/ml/engine/tools/McpStreamableHttpTool.java [82-89]

 private void recordInvocation(long startNanos, String status) {
     double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0;
+    if (latencyMs < 0) {
+        latencyMs = 0;
+    }
     Tags tags = Tags.create().addTag("protocol", MCP_STREAMABLE_HTTP).addTag("status", status);
     MLMcpConnectorMetricsCounter.getInstance().incrementCounter(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_COUNT, tags);
     MLMcpConnectorMetricsCounter
         .getInstance()
         .recordHistogram(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_LATENCY, latencyMs, tags);
 }
Suggestion importance[1-10]: 5

__

Why: Similar to the previous suggestions, this adds a defensive check against negative latency values. While unlikely with System.nanoTime(), it's a reasonable safeguard that prevents potential histogram corruption with minimal overhead.

Low
Suggestions up to commit 463d8ff
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard metrics recording against uninitialized singleton

MLMcpServerMetricsCounter.getInstance() throws IllegalStateException if the counter
has not been initialized. If recordToolCall is invoked before initialization (e.g.,
in tests or during early startup), it will crash the request handler. Cache the
instance in a local variable and wrap the call in a try-catch or null-check to
prevent an unhandled exception from propagating to the reactive sink.

plugin/src/main/java/org/opensearch/ml/action/mcpserver/McpToolsHelper.java [112-119]

 private static void recordToolCall(String toolType, String status, long startNanos) {
-    double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0;
-    Tags countTags = Tags.create().addTag("tool_type", toolType).addTag("status", status);
-    MLMcpServerMetricsCounter.getInstance().incrementCounter(McpServerMetric.MCP_SERVER_TOOL_CALL_COUNT, countTags);
-    MLMcpServerMetricsCounter
-        .getInstance()
-        .recordHistogram(McpServerMetric.MCP_SERVER_TOOL_CALL_LATENCY, latencyMs, Tags.create().addTag("tool_type", toolType));
+    try {
+        double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0;
+        MLMcpServerMetricsCounter counter = MLMcpServerMetricsCounter.getInstance();
+        Tags countTags = Tags.create().addTag("tool_type", toolType).addTag("status", status);
+        counter.incrementCounter(McpServerMetric.MCP_SERVER_TOOL_CALL_COUNT, countTags);
+        counter.recordHistogram(McpServerMetric.MCP_SERVER_TOOL_CALL_LATENCY, latencyMs, Tags.create().addTag("tool_type", toolType));
+    } catch (IllegalStateException e) {
+        log.warn("MCP server metrics counter not initialized, skipping metric recording", e);
+    }
 }
Suggestion importance[1-10]: 5

__

Why: The getInstance() call can throw IllegalStateException if not initialized, but the PR already initializes all counters unconditionally in MachineLearningPlugin.createComponents(), making this a low-probability edge case. The suggestion is valid but has limited practical impact given the initialization pattern used.

Low
General
Warn when paginated metrics collection is incomplete

collectMcpConnectorMetrics only fetches the first BATCH_SIZE documents but emits one
counter increment per hit, so if there are more connectors than BATCH_SIZE the
remaining ones are silently ignored. The method should either use
scroll/search-after pagination or at minimum log a warning when the total hits
exceed the fetched page size, to avoid under-counting.

plugin/src/main/java/org/opensearch/ml/jobs/processors/MLStatsJobProcessor.java [276-292]

 client.search(searchRequest, new ActionListener<SearchResponse>() {
     @Override
     public void onResponse(SearchResponse searchResponse) {
+        long totalHits = searchResponse.getHits().getTotalHits().value;
+        if (totalHits > BATCH_SIZE) {
+            log.warn("MCP connector count ({}) exceeds batch size ({}); metrics may be under-counted", totalHits, BATCH_SIZE);
+        }
         for (SearchHit hit : searchResponse.getHits()) {
             Object protocolObj = hit.getSourceAsMap().get("protocol");
             String protocol = protocolObj == null ? "unknown" : protocolObj.toString();
             Tags tags = Tags.create().addTag("protocol", protocol);
             MLMcpConnectorMetricsCounter.getInstance().incrementCounter(McpConnectorMetric.MCP_CONNECTOR_COUNT, tags);
         }
     }
Suggestion importance[1-10]: 5

__

Why: The truncation issue is real — collectMcpConnectorMetrics silently under-counts when connectors exceed BATCH_SIZE. Adding a warning log is a reasonable minimal fix, though full pagination would be more correct. The improved_code accurately reflects the suggested change.

Low
Avoid duplicate getInstance calls and guard uninitialized singleton

MLMcpConnectorMetricsCounter.getInstance() is called twice, which doubles the
synchronization overhead and risks an IllegalStateException if the singleton is not
initialized. Store the instance in a local variable and wrap the block in a
try-catch so a missing initialization does not propagate as an unhandled exception
to the caller.

ml-algorithms/src/main/java/org/opensearch/ml/engine/tools/McpSseTool.java [81-92]

 private void recordInvocation(long startNanos, String status) {
-    double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0;
-    Tags tags = Tags.create().addTag("protocol", MCP_SSE).addTag("status", status);
-    MLMcpConnectorMetricsCounter.getInstance().incrementCounter(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_COUNT, tags);
-    MLMcpConnectorMetricsCounter
-        .getInstance()
-        .recordHistogram(
+    try {
+        double latencyMs = (System.nanoTime() - startNanos) / 1_000_000.0;
+        MLMcpConnectorMetricsCounter counter = MLMcpConnectorMetricsCounter.getInstance();
+        Tags tags = Tags.create().addTag("protocol", MCP_SSE).addTag("status", status);
+        counter.incrementCounter(McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_COUNT, tags);
+        counter.recordHistogram(
             McpConnectorMetric.MCP_CONNECTOR_TOOL_INVOCATION_LATENCY,
             latencyMs,
             Tags.create().addTag("protocol", MCP_SSE)
         );
+    } catch (IllegalStateException e) {
+        log.warn("MCP connector metrics counter not initialized, skipping metric recording", e);
+    }
 }
Suggestion importance[1-10]: 4

__

Why: Caching the instance in a local variable is a minor improvement to avoid double synchronization, but the getInstance() method uses synchronized which has low overhead. The uninitialized guard is valid but unlikely given the plugin's initialization pattern.

Low

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 91.72932% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.45%. Comparing base (12f884e) to head (463d8ff).

Files with missing lines Patch % Lines
...search/ml/jobs/processors/MLStatsJobProcessor.java 84.61% 4 Missing and 2 partials ⚠️
...opensearch/ml/action/mcpserver/McpToolsHelper.java 81.25% 3 Missing ⚠️
.../ml/action/mcpserver/TransportMcpServerAction.java 86.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4803      +/-   ##
============================================
+ Coverage     77.42%   77.45%   +0.03%     
- Complexity    11907    11933      +26     
============================================
  Files           963      967       +4     
  Lines         53326    53450     +124     
  Branches       6503     6507       +4     
============================================
+ Hits          41285    41401     +116     
- Misses         9289     9298       +9     
+ Partials       2752     2751       -1     
Flag Coverage Δ
ml-commons 77.45% <91.72%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rithinpullela rithinpullela temporarily deployed to ml-commons-cicd-env-require-approval April 28, 2026 20:05 — with GitHub Actions Inactive
@rithinpullela rithinpullela temporarily deployed to ml-commons-cicd-env-require-approval April 28, 2026 20:05 — with GitHub Actions Inactive
Comment thread ml-algorithms/src/main/java/org/opensearch/ml/engine/tools/McpSseTool.java Outdated
Comment thread ml-algorithms/src/main/java/org/opensearch/ml/engine/tools/McpSseTool.java Outdated
Comment thread plugin/src/main/java/org/opensearch/ml/action/mcpserver/McpToolsHelper.java Outdated
Comment thread ml-algorithms/src/main/java/org/opensearch/ml/engine/tools/McpSseTool.java Outdated
Comment thread plugin/src/main/java/org/opensearch/ml/jobs/processors/MLStatsJobProcessor.java Outdated
Comment thread plugin/src/main/java/org/opensearch/ml/plugin/MachineLearningPlugin.java Outdated
@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit 32ceee9

Adds OTel-backed metrics for MCP connectors (client-side) and the ml-commons
MCP server (server-side), following the pattern established in opensearch-project#4221.

New metrics:
- ml.commons.MCP_CONNECTOR_COUNT (adoption, tag: protocol)
- ml.commons.MCP_CONNECTOR_TOOL_INVOCATION_COUNT/_LATENCY (operational, tags: protocol, status)
- ml.commons.MCP_SERVER_REGISTERED_TOOL_COUNT (adoption, tag: tool_type)
- ml.commons.MCP_SERVER_TOOL_CALL_COUNT/_LATENCY (operational, tags: tool_type, status)
- ml.commons.MCP_SERVER_REQUEST_COUNT (operational, tags: method, status)

Adoption metrics are emitted by MLStatsJobProcessor via periodic index scans;
operational metrics are emitted at call sites. "Domains using MCP" is derived
upstream from the auto-injected cluster_name tag.

Moves AbstractMLMetricsCounter and MetricType from plugin to common so
ml-algorithms can reference the counter singletons.

Also removes the isMetricCollectionEnabled() gate around counter initialization
in MachineLearningPlugin. Counter emission is already gated inside the abstract
base, so initializing eagerly makes getInstance() safe at all call sites and
fixes a latent IllegalStateException when metric collection is disabled.

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>
- Add testMetricsEmittedOnSuccessAndFailure to McpStreamableHttpToolTests
  (mirrors the SSE test; both tools have independent recordInvocation paths)
- Expand MachineLearningPlugin init comment with the full repro of the
  IllegalStateException bug that motivated eager counter initialization
- Add MLMcpMetricsCountersTests in common/src/test covering the MCP counter
  singleton scaffolding (JaCoCo reports per-module; call-site tests in other
  modules don't cover the new counter classes in common)
- Drop redundant tearDown reset in MLStatsJobProcessorTests (setUp reset is
  sufficient and matches the pattern used by MLAdoptionMetricsCounterTests)

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>
Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>
- Histogram tags: pass the same protocol+status / tool_type+status Tags object
  to recordHistogram so latency can be sliced by success vs failure
  (McpSseTool, McpStreamableHttpTool, McpToolsHelper).
- TransportMcpServerAction.extractMethod: collapse anything outside the
  server-served allowlist (initialize, notifications/initialized, ping,
  tools/list, tools/call, logging/setLevel) to "other" so untrusted client
  input cannot inflate the method-tag cardinality.
- MLStatsJobProcessor: log a warning when a metrics scan returns more hits
  than BATCH_SIZE so operators notice silent under-counting until pagination
  is added. Drop the unreachable "unknown" protocol fallback in the connector
  scan (search query already filters on protocol).
- MachineLearningPlugin: hoist clusterName local across the four counter
  initialize() calls.
- Tests: assert the histogram receives the status tag in the SSE/StreamableHttp
  tool tests; add a TransportMcpServerAction test proving an off-allowlist
  method is tagged as "other".

Signed-off-by: rithinpullela <rithinp@amazon.com>
@github-actions
Copy link
Copy Markdown

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 3fed456.

PathLineSeverityDescription
common/src/main/java/org/opensearch/ml/stats/otel/counters/MLMcpConnectorMetricsCounter.java47lowPublic static reset() methods on production singleton classes (MLMcpConnectorMetricsCounter and MLMcpServerMetricsCounter) are documented as test-only but are accessible from any caller in the same JVM. If invoked in a running cluster, they null out the singleton and cause subsequent getInstance() calls to throw IllegalStateException, disrupting metrics collection. The pattern is consistent with existing counters in the codebase and poses no direct data exfiltration risk, but the lack of access restriction is an anomaly worth noting.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit 3fed456

Signed-off-by: rithinpullela <rithinp@amazon.com>
@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit cc3d6dd

Extract recordInvocation duplication between McpSseTool and McpStreamableHttpTool
into a package-private McpToolMetrics helper. Strengthen MLMcpMetricsCountersTests
to mirror the MLAdoptionMetricsCounterTests / MLOperationalMetricsCounterTests
pattern (registry creation counts, increment/record routing, feature-flag toggle).

Signed-off-by: rithinpullela <rithinp@amazon.com>
@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit 625b272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants