feat: expose metrics for MCP calls#38
Conversation
WalkthroughAdds Prometheus metrics for MCP tool calls (counter and histogram), registers a /metrics HTTP endpoint, injects a metrics middleware into the MCP server request path, updates go.mod for Prometheus and k8s deps, extends e2e tests to validate metrics, and adjusts build/task and compose/README examples. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant MCPServer as MCP Server
participant MetricsMW as Metrics Middleware
participant Prometheus as Prometheus Collector
participant HTTP as /metrics Endpoint
Client->>MCPServer: Send MCP tool call
MCPServer->>MetricsMW: Invoke middleware (method, params)
MetricsMW->>MetricsMW: start timer, inspect params
MetricsMW->>MCPServer: Call next handler
MCPServer-->>MetricsMW: Return result, error
MetricsMW->>Prometheus: Increment MCPCallsTotal (method, tool, success)
MetricsMW->>Prometheus: Observe MCPCallDuration (method, tool, success)
MetricsMW-->>Client: Return result, error
Client->>HTTP: GET /metrics
HTTP->>Prometheus: Gather metrics
Prometheus-->>HTTP: Return exposition
HTTP-->>Client: Metrics payload
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧹 Recent nitpick comments
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (4)
✏️ Tip: You can disable this entire section by setting Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
draft until codeready-toolchain/toolchain-e2e#1243 #37 have been reviewed and merged |
the `mcp_calls_total` and `mcp_call_duration_seconds` metrics are collected via the SDK's `mcp.Middleware` and exposed on the `/metrics` endpoint of the server Signed-off-by: Xavier Coulon <xcoulon@redhat.com>
5fe0cdb to
e845c33
Compare
Signed-off-by: Xavier Coulon <xcoulon@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @internal/metrics/metrics.go:
- Line 10: The comment above the metric is incorrect: it references
"ToolCallsTotal" but the variable is named "MCPCallsTotal"; update the comment
to accurately describe MCPCallsTotal (e.g., "MCPCallsTotal counts total MCP
invocations by method/name") so the comment and the variable MCPCallsTotal in
internal/metrics/metrics.go match.
- Around line 19-26: The MCPCallDuration histogram uses seconds buckets but the
middleware is observing duration.Milliseconds(), causing wrong bucketing; in the
middleware function that records MCPCallDuration (the code that computes a
duration variable and calls MCPCallDuration.WithLabelValues(...).Observe(...)),
replace the use of duration.Milliseconds() with duration.Seconds() (no cast
needed) so the observed value is in seconds and aligns with
prometheus.DefBuckets.
In @internal/server/middleware.go:
- Line 34: The metric uses MCPCallDuration (named mcp_call_duration_seconds) but
records duration.Milliseconds(), causing a 1000x mismatch; update the
observation to use duration.Seconds() and cast to float64 (i.e., replace
duration.Milliseconds() with duration.Seconds()) in the
metrics.MCPCallDuration.WithLabelValues(method, tool,
strconv.FormatBool(success)).Observe(...) call so the recorded unit matches
seconds.
In @test/e2e/server_test.go:
- Line 249: The require.NoError call can panic because cmd.ProcessState may be
nil; modify the failure message around require.NoError(t, err, ...) to first
check cmd.ProcessState != nil and include cmd.ProcessState.ExitCode() only when
non-nil (e.g., compute a local variable like exitInfo := "<no process state>" or
an int default and conditionally format the message), otherwise omit or
substitute a safe placeholder; update the call that currently references
cmd.ProcessState.ExitCode() so it uses the guarded value and prevents nil
dereference.
🧹 Nitpick comments (2)
go.mod (1)
12-12: Version v1.22.0 is stable and secure.
prometheus/client_golang v1.22.0is a stable release (April 2025) with no known vulnerabilities; the major CVE-2022-21698 was fixed in v1.11.1. Consider upgrading to v1.23.2 (latest stable, September 2025) at a future maintenance window if alignment with the latest release is desired.internal/server/middleware.go (1)
23-26: Consider using a default value for non-tool requests.When the request is not a
CallToolParams, thetoolvariable remains an empty string. While Prometheus accepts empty label values, using a default like"N/A"or"none"would improve metric queryability and clarity.♻️ Proposed refactor
- var tool string + tool := "N/A" if p, ok := req.GetParams().(*mcp.CallToolParamsRaw); ok { tool = p.Name }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (7)
README.mdcmd/start_server.gogo.modinternal/metrics/metrics.gointernal/server/middleware.gointernal/server/server.gotest/e2e/server_test.go
🧰 Additional context used
🧬 Code graph analysis (2)
internal/server/middleware.go (1)
internal/metrics/metrics.go (2)
MCPCallsTotal(11-17)MCPCallDuration(19-26)
internal/server/server.go (1)
internal/server/middleware.go (1)
NewMetricsMiddleware(13-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Lint
- GitHub Check: Test
🔇 Additional comments (11)
README.md (1)
117-117: LGTM! Necessary for container accessibility.Setting
ARGOCD_MCP_SERVER_LISTEN_HOST=0.0.0.0ensures the server binds to all network interfaces, making it accessible from outside the container. This aligns well with the port mapping configuration.internal/server/server.go (1)
27-27: LGTM! Metrics middleware integration is well-placed.The metrics middleware is correctly positioned to capture all incoming MCP requests. Placing it after prompts but before tools ensures comprehensive coverage of both tool invocations and prompt handling.
cmd/start_server.go (2)
62-62: LGTM! Log key renamed for clarity.Renaming the log key from "url" to "argocd-url" improves clarity and consistency with the flag name.
86-87: Consider access control for the metrics endpoint.The
/metricsendpoint exposes operational metrics without authentication. While this is standard for internal Kubernetes deployments where metrics are scraped within a trusted network, verify this aligns with your security requirements.If the server is exposed to untrusted networks, consider adding authentication or restricting access to the metrics endpoint.
go.mod (1)
16-16: The k8s.io/client-go version override is intentional—no conflict exists.Line 16 declares
k8s.io/client-go v0.33.0, and line 182 replaces it withv0.32.2. This is standard Go practice using replace directives. The effective version is unambiguous:v0.33.0 => v0.32.2. All k8s.io packages are consistently pinned to v0.32.2, suggesting a deliberate compatibility strategy. If the reasoning for this downgrade isn't already documented, consider adding a comment in go.mod explaining why v0.33.0 is required but v0.32.2 is used.test/e2e/server_test.go (4)
15-16: LGTM!The new imports for metrics verification (
toolchaintestsandrest) are correctly added and align with the metrics testing requirements.Also applies to: 23-23
175-180: LGTM!Metrics verification is consistently applied across both success and error scenarios. The pattern correctly validates that the
mcp_calls_totalmetric is incremented with appropriate label values for each test case.Also applies to: 195-200
210-210: LGTM!The test case renames to include "-unreachable" improve clarity, and the change from
t.Errorftot.Logfon line 267 appropriately handles expected exit errors during teardown.Also applies to: 214-214, 267-267
106-112: Verification of toolchaintests.GetMetricValue API usage is inconclusive.The code consistently uses flattened array format
[]string{"method", "tools/call", "name", "unhealthyApplications", "success", "true"}across all metric queries in the test file. However, the actualGetMetricValueimplementation in the externalgithub.com/codeready-toolchain/toolchain-e2e/testsupport/metricspackage could not be accessed to confirm the expected label format against the API specification. Verify this against the toolchain-e2e library's actual API documentation to ensure compliance.internal/server/middleware.go (2)
13-22: LGTM!The middleware structure follows the standard pattern correctly: it logs the request, records the start time, invokes the next handler, and measures the duration.
27-31: LGTM!The success determination logic correctly accounts for both Go errors (
err == nil) and MCP protocol errors (!r.IsError), ensuring accurate metrics labeling for both transport-level and business-logic failures.
rsoaresd
left a comment
There was a problem hiding this comment.
Great work 🚀
Just minor question for learning purposes
| // also, check the metrics when the server runs on HTTP | ||
| if td.name == "http" { | ||
| // get the metrics | ||
| metric, err := toolchaintests.GetMetricValue(&rest.Config{}, "http://"+MCPServerListen, `mcp_calls_total`, []string{"method", "tools/call", "name", "unhealthyApplications", "success", "true"}) |
There was a problem hiding this comment.
is there a reason why we are not checking mcp_call_duration_seconds? Sorry if it is a noob question, just want to understand
There was a problem hiding this comment.
good point, let me add a check for that metric ;)
alexeykazakov
left a comment
There was a problem hiding this comment.
Looks good bu take a look at the CodeRabitAI comments :)
Any plans for adding metrics to the sandbox MCP servers too?
|
|
||
| var ( | ||
|
|
||
| // ToolCallsTotal counts total tool invocation by tool name |
There was a problem hiding this comment.
| // ToolCallsTotal counts total tool invocation by tool name | |
| // MCPCallsTotal counts total tool invocation by tool name |
I will, definitely ;)
well, @mfrancisc has already done a round of metrics here, and I will port my work here into https://github.com/codeready-toolchain/devsandbox-observability-mcp-server once this PR is merged |
| // Metrics endpoint | ||
| mux.Handle("/metrics", promhttp.Handler()) |
There was a problem hiding this comment.
I think we usually expose the metrics using it's own server and different port , but I guess this is fine for the mcp server that is not exposed publicly.
There was a problem hiding this comment.
I've opened a discussion on Slack for this specific question
| MCPCallsTotal = promauto.NewCounterVec( | ||
| prometheus.CounterOpts{ | ||
| Name: "mcp_calls_total", | ||
| Help: "Total number of MCP calls", | ||
| }, | ||
| []string{"method", "name", "success"}, | ||
| ) |
There was a problem hiding this comment.
I think we should have separate metrics for tool calls and method calls.
For example as https://github.com/codeready-toolchain/mcp-server-devsandbox/blob/master/pkg/metrics/metrics.go
- the
mcp_method_calls_totalwill count invocations to standard MCP protocol operations like :initialize,ping,tools/list - the
mcp_tool_calls_totalwill count calls per tool
There was a problem hiding this comment.
Also it would be nice to have mcp_tool_call_errors_total counting errors per tool . So that we can have insights on tools that are failing ( maybe because the llm can't figure out how to call them properly or for any other reason .. )
There was a problem hiding this comment.
but since we have labels for the mcp_calls_total, we can filter by method (initialize, tools/call, etc.) and then for tools, by name (unhealthyApplications, unhealthyApplicationResources etc.), and also by success (true, false).
I mean, I don't see the benefits of having multiple metrics.
Also, that's what is usually done with HTTP metrics: you have labels for the method (get, post) and the return code (200, 404, 500, etc.)
There was a problem hiding this comment.
I thought that having separated metrics for tools and methods would help with creating separate dashboards , since I guess we would be mostly interested into looking at the tools charts ( more than the methods ones ), but maybe I'm wrong.
There was a problem hiding this comment.
I believe we ca have a single metric and use the labels to focus on tool calls per name, etc.
Signed-off-by: Xavier Coulon <xcoulon@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
internal/metrics/metrics.go (2)
19-27: Consider custom histogram buckets (DefBuckets may not fit MCP latency profile).
prometheus.DefBucketscan be too coarse/short for RPC-like calls (or too coarse at the low end). If you have expected SLOs (e.g., p95 ~ 200ms, but occasional 30s), tune buckets accordingly so the histogram is actually useful.Possible bucket tuning (example)
MCPCallDuration = promauto.NewHistogramVec( prometheus.HistogramOpts{ Name: "mcp_call_duration_seconds", Help: "Duration of MCP calls in seconds", - Buckets: prometheus.DefBuckets, + // Adjust based on expected MCP latency distribution / SLOs. + Buckets: []float64{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30}, }, []string{"method", "name", "success"}, )
1-6: The current implementation is sound and avoids the risks mentioned. The metric names (mcp_calls_totalandmcp_call_duration_seconds) are registered only once in the codebase, eliminating duplicate registration concerns. Label cardinality is well-controlled: thenamelabel is populated bytool(bounded to available tools), not unbounded user input, andsuccessis a boolean. If embedding or custom testing scenarios require a customprometheus.Registererin the future, it can be refactored then—the current default-registry approach is appropriate for this service.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
internal/metrics/metrics.go
🔇 Additional comments (1)
internal/metrics/metrics.go (1)
8-17: Thenamelabel cardinality is bounded and not a concern.The
namelabel only takes values from the two registered tools ("unhealthyApplications"and"unhealthyApplicationResources") or an empty string for non-tools/callmethods. There is no risk of unbounded cardinality or memory explosion. While using a sentinel value like"__none__"instead of an empty string would be slightly clearer for code readability, this is not a functional issue.Likely an incorrect or invalid review comment.
Signed-off-by: Xavier Coulon <xcoulon@redhat.com>
Signed-off-by: Xavier Coulon <xcoulon@redhat.com>
mfrancisc
left a comment
There was a problem hiding this comment.
Looks good 👍
Thanks for answering my questions and for the discussion !
rajivnathan
left a comment
There was a problem hiding this comment.
Assuming you still plan to add the test for mcp_call_duration_seconds ?
yes, it's still something I plan to add, indeed! I've switched to #40 for now but I'll come back to this one right after that ;) |
Signed-off-by: Xavier Coulon <xcoulon@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@test/e2e/server_test.go`:
- Around line 266-268: The test helper valueInLastBucket should guard against an
empty buckets slice and avoid unsafe uint64->int conversion; update
valueInLastBucket to return 0 (or an error) when len(buckets) == 0, read the
uint64 value from buckets[len(buckets)-1].CumulativeCount into a uint64
variable, then clamp it to the runtime int range (compare against math.MaxInt)
before converting to int so you never overflow on 32-bit systems; reference the
valueInLastBucket function and buckets[..].CumulativeCount when making this
change.
🧹 Nitpick comments (1)
test/e2e/server_test.go (1)
253-264: Consider parameterizingmcpServerURLor making it a constant.The static analysis tool flags that
mcpServerURLalways receives"http://localhost:50081". While this is minor for test code, you could either extract it as a constant or accept the lint warning with a//nolint:unparamdirective if this is intentional.♻️ Option: Extract as a constant
+const mcpServerMetricsURL = "http://localhost:50081" + -func getMetrics(t *testing.T, mcpServerURL string, labels map[string]string) (int, []*io_prometheus_client.Bucket) { +func getMetrics(t *testing.T, labels map[string]string) (int, []*io_prometheus_client.Bucket) { labelStrings := make([]string, 0, 2*len(labels)) for k, v := range labels { labelStrings = append(labelStrings, k) labelStrings = append(labelStrings, v) } - mcpCallsTotalMetric, err := toolchaintests.GetMetricValue(&rest.Config{}, mcpServerURL, `mcp_calls_total`, labelStrings) + mcpCallsTotalMetric, err := toolchaintests.GetMetricValue(&rest.Config{}, mcpServerMetricsURL, `mcp_calls_total`, labelStrings) require.NoError(t, err) - mcpCallsDurationSecondsBuckets, err := toolchaintests.GetHistogramBuckets(&rest.Config{}, mcpServerURL, `mcp_call_duration_seconds`, labelStrings) + mcpCallsDurationSecondsBuckets, err := toolchaintests.GetHistogramBuckets(&rest.Config{}, mcpServerMetricsURL, `mcp_call_duration_seconds`, labelStrings) require.NoError(t, err) return int(mcpCallsTotalMetric), mcpCallsDurationSecondsBuckets }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (6)
.github/workflows/cd.ymlREADME.mdcmd/start_server.gogo.modtaskfile.yamltest/e2e/server_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- README.md
🧰 Additional context used
🪛 GitHub Actions: ci
test/e2e/server_test.go
[error] 267-267: Gosec: G115: integer overflow conversion uint64 -> int. Potential overflow when converting uint64 to int in test/e2e/server_test.go:267.
🪛 GitHub Check: Lint
test/e2e/server_test.go
[failure] 253-253:
getMetrics - mcpServerURL always receives "http://localhost:50081" (unparam)
[failure] 267-267:
G115: integer overflow conversion uint64 -> int (gosec)
🔇 Additional comments (8)
.github/workflows/cd.yml (1)
24-24: LGTM!The containerfile reference is correctly updated from
Containerfile.yamltoContainerfile, consistent with the corresponding change intaskfile.yaml.taskfile.yaml (2)
17-27: LGTM!Good modernization: updating from
podman-composetopodman compose(Compose v2 syntax) and shifting the build strategy to container-based builds withpodman compose build.
39-39: LGTM!The Containerfile reference is correctly updated to match
.github/workflows/cd.yml.cmd/start_server.go (2)
15-15: LGTM!Correct import for exposing Prometheus metrics via HTTP.
86-87: LGTM!Standard approach for exposing Prometheus metrics. The
/metricsendpoint correctly usespromhttp.Handler()from the Prometheus client library.test/e2e/server_test.go (2)
10-19: LGTM!Imports are correctly added for Prometheus metrics testing utilities.
49-97: LGTM!Good test pattern: capturing metrics before the call, executing the tool call, then verifying both
mcp_calls_totalincremented by 1 andmcp_call_duration_secondshistogram recorded the call. The HTTP-transport guard is appropriate since stdio doesn't expose metrics.go.mod (1)
8-17: LGTM!Dependencies are correctly added to support the Prometheus metrics feature:
toolchain-e2efor metrics test utilitiesprometheus/client_golangfor metrics instrumentationk8s.io/client-gorequired by the test utilities
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
Signed-off-by: Xavier Coulon <xcoulon@redhat.com>
… use `docker-compose` Signed-off-by: Xavier Coulon <xcoulon@redhat.com>
Signed-off-by: Xavier Coulon <xcoulon@redhat.com>
the
mcp_calls_totalandmcp_call_duration_secondsmetrics are collectedvia the SDK's
mcp.Middlewareand exposed on the/metricsendpoint of the serverSummary by CodeRabbit
New Features
Updates
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.