Skip to content

Commit c384fd7

Browse files
authored
Merge pull request #109810 from ShaunaDiaz/OSDOCS-18942
OSDOCS-18942: adds observability to MCP gateway
2 parents f19c445 + 8ac5ff0 commit c384fd7

4 files changed

Lines changed: 280 additions & 2 deletions

File tree

_attributes/attributes.adoc

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,9 @@
2020
:ocp-min-version: 4.19
2121
:oc-first: pass:quotes[OpenShift CLI (`oc`)]
2222

23-
:ossm: OpenShift Service Mesh
23+
:OTELName: Red{nbsp}Hat build of OpenTelemetry
24+
25+
:service-mesh: OpenShift Service Mesh
2426
:service-mesh-version: 3.2
2527

2628
:cert-manager: cert-manager Operator for Red Hat OpenShift
@@ -32,3 +34,4 @@
3234
:TempoShortName: Distributed Tracing Platform
3335
:TempoOperator: Tempo Operator
3436
:TempoVersion: 2.3.1
37+
:DTShortName: distributed tracing

modules/proc-mcp-gateway-otel.adoc

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
// Module included in the following assemblies:
2+
//
3+
// *mcp_gateway_config/mcp-gateway-observe.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="proc-mcp-gateway-otel_{context}"]
7+
= Enabling {OTELName} for the {mcpg}
8+
9+
[role="_abstract"]
10+
The MCP gateway uses {OTELName} throughout all components to give you consistent logging, {DTShortName}, and metrics. By using observability, you can achieve the following goals:
11+
12+
* Discover which component generated an error, such as Envoy, an MCP gateway component, or a backend MCP server.
13+
* Understand how requests flow through the system components.
14+
* See which tools were called and which MCP servers executed those tools.
15+
* Understand where to find relevant logs for each component.
16+
* Examine metrics about tool call patterns, success rates, and error rates.
17+
* Examine {DTShortName} across the request path.
18+
19+
.Prerequisites
20+
21+
* You installed {mcpg}.
22+
* You installed {prodname}.
23+
* You configured a `Gateway` object.
24+
* You are logged into a running {ocp} cluster with an `admin` role.
25+
* You installed {OTELName}.
26+
* You configured the {TempoName} for traces.
27+
* You installed the OpenShift Logging Operator (Loki).
28+
* You configured an S3-compatible persistent storage volume for Loki to store long-term MCP tool call logs.
29+
30+
.Procedure
31+
32+
. Use the following example YAML to create a collector in your MCP gateway deployment namespace:
33+
+
34+
.Example OpenTelemetryCollector custom resource (CR)
35+
[source,yaml,subs="+quotes"]
36+
----
37+
apiVersion: opentelemetry.io/v1alpha1
38+
kind: OpenTelemetryCollector
39+
metadata:
40+
name: _<mcp_otel_collector>_
41+
namespace: _<mcp_gateway_system>_
42+
spec:
43+
mode: deployment
44+
config:
45+
receivers:
46+
otlp:
47+
protocols:
48+
http:
49+
endpoint: 0.0.0.0:4318
50+
grpc:
51+
endpoint: 0.0.0.0:4317
52+
exporters:
53+
otlp/tempo:
54+
endpoint: tempo-gateway-http.openshift-tracing.svc.cluster.local:4317
55+
tls:
56+
insecure: true
57+
debug:
58+
verbosity: basic
59+
service:
60+
pipelines:
61+
traces:
62+
receivers: [otlp]
63+
exporters: [otlp/tempo, debug]
64+
logs:
65+
receivers: [otlp]
66+
exporters: [debug]
67+
----
68+
+
69+
* Replace the value of `metadata.name:` with the name you want to assign to this collector.
70+
* Replace the value of `metadata.namespace:` with the namespace of your MCP gateway deployment.
71+
* The `spec.config.exporters:` field value points to {DTShortName}.
72+
* The setting, `spec.config.exporters.otlp/tempo.tls.insecure: true` is for internal cluster communication without TLS.
73+
74+
. Apply the following environment variables to your {ocp} cluster by running the following command:
75+
+
76+
[source,terminal]
77+
----
78+
$ oc set env deployment/mcp-gateway \
79+
OTEL_EXPORTER_OTLP_ENDPOINT="http://mcp-otel-collector-collector.mcp-gateway-system.svc.cluster.local:4318" \
80+
OTEL_EXPORTER_OTLP_INSECURE="true" \
81+
OTEL_SERVICE_NAME="mcp-gateway"
82+
----
83+
+
84+
When {OTELName} creates a collector, the service name is `[metadata.name]-collector`.
85+
86+
. Optional. If you want to send traces to one collector and logs to a different one, set the following additional environment variables:
87+
+
88+
[source,terminal]
89+
----
90+
$ oc set env deployment/mcp-gateway \
91+
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://trace-collector.tracing-namespace.svc:4317" \
92+
OTEL_EXPORTER_OTLP_LOGS_ENDPOINT="http://log-collector.logging-namespace.svc:4317"
93+
----
94+
+
95+
* Set `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` to override the endpoint for traces.
96+
* Set `OTEL_EXPORTER_OTLP_LOGS_ENDPOINT` to override the endpoint for logs.
97+
98+
. If you are using {TempoName}, select your stack name as the data source in the OpenShift Web Console *Observe > Traces* dashboard. For example, `tempo-mcp`.
99+
100+
. After you select the stack name, you can consult an attribute in your OpenShift Web Console *OpenShift Observe > Traces* dashboard to find where errors have occurred or to gather information for trend identification.
101+
102+
* You can also filter by `service.name="mcp-gateway"`.
103+
104+
. When using Loki, use the *Observe > Logs* view and toggle the *Structured* view to filter by `mcp.session.id`.
105+
106+
. When using Loki, use the *Trace timeline* view, look for the `Log` icon next to `spans` to jump directly to the relevant logs.
107+
108+
.Troubleshooting
109+
110+
* If you do not see any traces, check if the `NetworkPolicy` CR in your namespace allows traffic from the MCP gateway to the {OTELName} collector service.
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
// Module included in the following assemblies:
2+
//
3+
// *mcp_gateway_config/mcp-gateway-observe.adoc
4+
5+
:_mod-docs-content-type: REFERENCE
6+
[id="ref-mcp-gateway-router-spans_{context}"]
7+
= {mcpg} router spans for observability
8+
9+
[role="_abstract"]
10+
When enabled, the MCP router, `ext_proc`, emits trace spans for every request and can export structured logs through {OTELName}.
11+
12+
With traces, you can see one continuous timeline of traffic events across different pods and services. This can help you identify bottlenecks and conduct root-cause analysis.
13+
14+
Spans show one part of the journey. The attributes attached to those spans give you the contextual metadata that you need to have searchable and meaningful traces. The attributes are simple `key: value` pairs attached to each span.
15+
16+
The MCP gateway uses the following router spans:
17+
18+
.MCP router spans
19+
[cols="1,1,2", options="header"]
20+
|===
21+
|Span |When |Description
22+
23+
|`mcp-router.process`
24+
|Every `ext_proc` stream
25+
|Root span. Starts when request headers arrive, ends after response headers are processed.
26+
27+
|`mcp-router.route-decision`
28+
|The request body is parsed
29+
|Routing decision: `tool-call` versus pass-through to broker.
30+
31+
|`mcp-router.broker-passthrough`
32+
|Non-`tool-call` requests
33+
|Pass-through to broker: `initialize`, `tools/list`, `notifications`. No child spans means the broker is not instrumented.
34+
35+
|`mcp-router.tool-call`
36+
|`tools/call` requests
37+
|Full tool call handling including session and server resolution.
38+
39+
|`mcp-router.broker.get-server-info`
40+
|Inside `tool-call`
41+
|Call out to broker to resolve which backend server owns the tool.
42+
43+
|`mcp-router.session-cache.get`
44+
|Inside `tool-call`
45+
|Call out to session cache to look up an existing backend session.
46+
47+
|`mcp-router.session-init`
48+
|Cache miss during `tool-call`
49+
|Hairpin `initialize` request through the gateway to the backend MCP server.
50+
51+
|`mcp-router.session-cache.store`
52+
|After `session-init`
53+
|Call out to session cache to store the new backend session.
54+
|===
55+
56+
The attributes in the following tables use link:https://opentelemetry.io/docs/specs/semconv/gen-ai/mcp/#server[OpenTelemetry MCP Semantic Conventions].
57+
58+
.MCP router root span, `mcp-router.process`, attributes
59+
[cols="1,1,2", options="header"]
60+
|===
61+
|Attribute |Source |Description
62+
63+
|`http.method`
64+
|`:method` header
65+
|HTTP method, `POST`
66+
67+
|`http.path`
68+
|`:path` header
69+
|Request path, `/mcp`
70+
71+
|`http.request_id`
72+
|`x-request-id` header
73+
|Envoy request ID
74+
75+
|`mcp.method.name`
76+
|JSON-RPC `method` field
77+
|MCP method, `initialize`, `tools/call`, `tools/list`, and so on
78+
79+
|`gen_ai.tool.name`
80+
|JSON-RPC `params.name`
81+
|Tool name, only for `tools/call`
82+
83+
|`jsonrpc.request.id`
84+
|JSON-RPC `id` field
85+
|JSON-RPC request ID
86+
87+
|`jsonrpc.protocol.version`
88+
|JSON-RPC `jsonrpc` field
89+
|Always "2.0"
90+
91+
|`gen_ai.operation.name`
92+
|JSON-RPC `method` field
93+
|Same as `mcp.method.name`
94+
95+
|`mcp.session.id`
96+
|`mcp-session-id` header
97+
|Gateway session ID
98+
99+
|`client.address`
100+
|`x-forwarded-for` header
101+
|Client IP address
102+
103+
|`http.status_code`
104+
|`:status` response header
105+
|Response status code
106+
|===
107+
108+
.MCP route decision span, `mcp-router.route-decision`, attributes
109+
[cols="1,2", options="header"]
110+
|===
111+
|Attribute |Description
112+
113+
|`mcp.method.name`
114+
|MCP method
115+
116+
|`mcp.route`
117+
|Routing decision: `tool-call`, `broker`, or `elicitation-response`
118+
|===
119+
120+
.MCP tool call span, `mcp-router.tool-call`, attributes
121+
[cols="1,2", options="header"]
122+
|===
123+
|Attribute |Description
124+
125+
|`gen_ai.tool.name`
126+
|Tool name from the request
127+
128+
|`mcp.session.id`
129+
|Gateway session ID
130+
131+
|`mcp.server`
132+
|Resolved backend server name
133+
134+
|`mcp.server.hostname`
135+
|Resolved backend server hostname
136+
|===
137+
138+
.MCP error attributes
139+
[cols="1,2", options="header"]
140+
|===
141+
|Attribute |Description
142+
143+
|`error.type`
144+
|Error classification, such as `tool_not_found`, `missing_tool_name`, `invalid_session`, `session_cache_error`, `session_init_error`, `marshal_error`, `path_parse_error`
145+
146+
|`error_source`
147+
|Component that generated the error, such as`ext-proc`
148+
149+
|`http.status_code`
150+
|HTTP status code returned
151+
|===
152+
153+
When there is an error, spans include these attributes.

observe_troubleshoot/mcp-gateway-observe.adoc

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,16 @@ include::_attributes/attributes.adoc[]
55
:context: mcp-gateway-observe
66

77
[role="_abstract"]
8-
FPO assembly
8+
You can use observability in {mcpg} to debug things such as silent failures, track destructive annotations, track latencies, and make an audit trail.
9+
10+
include::modules/proc-mcp-gateway-otel.adoc[leveloffset=+1]
11+
12+
include::modules/ref-mcp-gateway-router-spans.adoc[leveloffset=+2]
13+
14+
[id="additional-resources_mcp-gateway-observe"]
15+
[role="_additional-resources"]
16+
== Additional resources
17+
18+
* link:https://docs.redhat.com/en/documentation/red_hat_build_of_opentelemetry/3.9/html-single/configuring_the_collector/index[Chapter 1. Configuring the Collector (OpenTelemetry documentation)]
19+
20+
* link:https://modelcontextprotocol.io/docs/tools/inspector[MCP Inspector (Model Context Protocol developer tools documentation)]

0 commit comments

Comments
 (0)