Skip to content

Commit 33ad987

Browse files
authored
Support MCP observability for Envoy AI Gateway (#13791)
**MAL rules** (new files): - `gateway-mcp-service.yaml` — 13 MCP service-level metrics (request CPM/latency/percentile, method CPM, error CPM, initialization latency, capabilities, per-backend breakdown) - `gateway-mcp-instance.yaml` — 13 MCP instance-level metrics **LAL rules** (modified `envoy-ai-gateway.yaml`): - Split into two rules: `envoy-ai-gateway-llm-access-log` and `envoy-ai-gateway-mcp-access-log` - LLM logs: persist error responses (>= 400) and upstream failures only - MCP logs: persist error responses (>= 400) only - Both rules tag `ai_route_type` (`llm` or `mcp`) for searchable filtering **Dashboard** (modified service + instance JSON): - Added **MCP** tab with 9 widgets (service) / 6 widgets (instance): request CPM, latency avg/percentile, error CPM, method CPM, initialization latency, backend breakdown **E2E test** (modified): - Added `mcp-server` service (`tzolov/mcp-everything-server:v3` — MCP reference server with StreamableHttp) - Added MCP request steps (initialize + tools/list + tools/call) - Added MCP metric verification cases - Log query uses `ai_route_type=llm` tag filter **Config**: - Added `ai_route_type` to `searchableLogsTags` in `application.yml` - Fixed aigw healthcheck binary path (`/app` instead of `aigw`)
1 parent 8907f0c commit 33ad987

File tree

15 files changed

+1100
-104
lines changed

15 files changed

+1100
-104
lines changed

docs/en/changes/changes.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
* Fix missing parentheses around OR conditions in `JDBCZipkinQueryDAO.getTraces()`, which caused the table filter to be bypassed for all but the first trace ID. Replaced with a proper `IN` clause.
1212
* Fix missing `and` keyword in `JDBCEBPFProfilingTaskDAO.getTaskRecord()` SQL query, which caused a syntax error on every invocation.
1313
* Fix duplicate `TABLE_COLUMN` condition in `JDBCMetadataQueryDAO.findEndpoint()`, which was binding the same parameter twice due to a copy-paste error.
14+
* Support MCP (Model Context Protocol) observability for Envoy AI Gateway: MCP metrics (request CPM/latency, method breakdown, backend breakdown, initialization latency, capabilities), MCP access log sampling (errors only), `ai_route_type` searchable log tag, and MCP dashboard tabs.
1415

1516
#### UI
1617

docs/en/setup/backend/backend-envoy-ai-gateway-monitoring.md

Lines changed: 76 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,9 @@
44

55
[Envoy AI Gateway](https://aigateway.envoyproxy.io/) is a gateway/proxy for AI/LLM API traffic
66
(OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Google Gemini, etc.) built on top of Envoy Proxy.
7-
It natively emits GenAI metrics and access logs via OTLP, following
8-
[OpenTelemetry GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/).
7+
It natively emits GenAI metrics following
8+
[OpenTelemetry GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/),
9+
and also emits MCP (Model Context Protocol) metrics and access logs via OTLP.
910

1011
SkyWalking receives OTLP metrics and logs directly on its gRPC port (11800) — no OpenTelemetry
1112
Collector is needed between the AI Gateway and SkyWalking OAP.
@@ -15,7 +16,7 @@ Collector is needed between the AI Gateway and SkyWalking OAP.
1516
[Envoy AI Gateway getting started](https://aigateway.envoyproxy.io/docs/getting-started/) for installation.
1617

1718
### Data flow
18-
1. Envoy AI Gateway processes LLM API requests and records GenAI metrics (token usage, latency, TTFT, TPOT).
19+
1. Envoy AI Gateway processes LLM API requests and MCP requests, recording GenAI metrics and MCP metrics.
1920
2. The AI Gateway pushes metrics and access logs via OTLP gRPC to SkyWalking OAP.
2021
3. SkyWalking OAP parses metrics with [MAL](../../concepts-and-designs/mal.md) rules and access logs
2122
with [LAL](../../concepts-and-designs/lal.md) rules.
@@ -27,14 +28,14 @@ in SkyWalking OAP. No OAP-side configuration is needed.
2728

2829
Configure the AI Gateway to push OTLP to SkyWalking by setting these environment variables:
2930

30-
| Env Var | Value | Purpose |
31-
|---------|-------|---------|
32-
| `OTEL_SERVICE_NAME` | Per-deployment gateway name (e.g., `my-ai-gateway`) | SkyWalking service name |
33-
| `OTEL_EXPORTER_OTLP_ENDPOINT` | `http://skywalking-oap:11800` | SkyWalking OAP gRPC receiver |
34-
| `OTEL_EXPORTER_OTLP_PROTOCOL` | `grpc` | OTLP transport |
35-
| `OTEL_METRICS_EXPORTER` | `otlp` | Enable OTLP metrics push |
36-
| `OTEL_LOGS_EXPORTER` | `otlp` | Enable OTLP access log push |
37-
| `OTEL_RESOURCE_ATTRIBUTES` | See below | Routing + instance + layer |
31+
| Env Var | Value | Purpose |
32+
|-------------------------------|-----------------------------------------------------|------------------------------|
33+
| `OTEL_SERVICE_NAME` | Per-deployment gateway name (e.g., `my-ai-gateway`) | SkyWalking service name |
34+
| `OTEL_EXPORTER_OTLP_ENDPOINT` | `http://skywalking-oap:11800` | SkyWalking OAP gRPC receiver |
35+
| `OTEL_EXPORTER_OTLP_PROTOCOL` | `grpc` | OTLP transport |
36+
| `OTEL_METRICS_EXPORTER` | `otlp` | Enable OTLP metrics push |
37+
| `OTEL_LOGS_EXPORTER` | `otlp` | Enable OTLP access log push |
38+
| `OTEL_RESOURCE_ATTRIBUTES` | See below | Routing + instance + layer |
3839

3940
**Required resource attributes** (in `OTEL_RESOURCE_ATTRIBUTES`):
4041
- `job_name=envoy-ai-gateway` — Fixed routing tag for MAL/LAL rules. Same for all AI Gateway deployments.
@@ -58,47 +59,86 @@ is a service, each pod is an instance. Metrics include per-provider and per-mode
5859

5960
#### Service Metrics
6061

61-
| Monitoring Panel | Unit | Metric Name | Description |
62-
|---|---|---|---|
63-
| Request CPM | calls/min | meter_envoy_ai_gw_request_cpm | Requests per minute |
64-
| Request Latency Avg | ms | meter_envoy_ai_gw_request_latency_avg | Average request duration |
65-
| Request Latency Percentile | ms | meter_envoy_ai_gw_request_latency_percentile | P50/P75/P90/P95/P99 |
66-
| Input Token Rate | tokens/min | meter_envoy_ai_gw_input_token_rate | Input (prompt) tokens per minute |
67-
| Output Token Rate | tokens/min | meter_envoy_ai_gw_output_token_rate | Output (completion) tokens per minute |
68-
| TTFT Avg | ms | meter_envoy_ai_gw_ttft_avg | Time to First Token (streaming only) |
69-
| TTFT Percentile | ms | meter_envoy_ai_gw_ttft_percentile | P50/P75/P90/P95/P99 TTFT |
70-
| TPOT Avg | ms | meter_envoy_ai_gw_tpot_avg | Time Per Output Token (streaming only) |
71-
| TPOT Percentile | ms | meter_envoy_ai_gw_tpot_percentile | P50/P75/P90/P95/P99 TPOT |
62+
| Monitoring Panel | Unit | Metric Name | Description |
63+
|----------------------------|------------|----------------------------------------------|----------------------------------------|
64+
| Request CPM | calls/min | meter_envoy_ai_gw_request_cpm | Requests per minute |
65+
| Request Latency Avg | ms | meter_envoy_ai_gw_request_latency_avg | Average request duration |
66+
| Request Latency Percentile | ms | meter_envoy_ai_gw_request_latency_percentile | P50/P75/P90/P95/P99 |
67+
| Input Token Rate | tokens/min | meter_envoy_ai_gw_input_token_rate | Input (prompt) tokens per minute |
68+
| Output Token Rate | tokens/min | meter_envoy_ai_gw_output_token_rate | Output (completion) tokens per minute |
69+
| TTFT Avg | ms | meter_envoy_ai_gw_ttft_avg | Time to First Token (streaming only) |
70+
| TTFT Percentile | ms | meter_envoy_ai_gw_ttft_percentile | P50/P75/P90/P95/P99 TTFT |
71+
| TPOT Avg | ms | meter_envoy_ai_gw_tpot_avg | Time Per Output Token (streaming only) |
72+
| TPOT Percentile | ms | meter_envoy_ai_gw_tpot_percentile | P50/P75/P90/P95/P99 TPOT |
7273

7374
#### Provider Breakdown Metrics
7475

75-
| Monitoring Panel | Unit | Metric Name | Description |
76-
|---|---|---|---|
77-
| Provider Request CPM | calls/min | meter_envoy_ai_gw_provider_request_cpm | Requests by provider |
78-
| Provider Token Rate | tokens/min | meter_envoy_ai_gw_provider_token_rate | Token rate by provider |
79-
| Provider Latency Avg | ms | meter_envoy_ai_gw_provider_latency_avg | Latency by provider |
76+
| Monitoring Panel | Unit | Metric Name | Description |
77+
|----------------------|------------|----------------------------------------|------------------------|
78+
| Provider Request CPM | calls/min | meter_envoy_ai_gw_provider_request_cpm | Requests by provider |
79+
| Provider Token Rate | tokens/min | meter_envoy_ai_gw_provider_token_rate | Token rate by provider |
80+
| Provider Latency Avg | ms | meter_envoy_ai_gw_provider_latency_avg | Latency by provider |
8081

8182
#### Model Breakdown Metrics
8283

83-
| Monitoring Panel | Unit | Metric Name | Description |
84-
|---|---|---|---|
85-
| Model Request CPM | calls/min | meter_envoy_ai_gw_model_request_cpm | Requests by model |
86-
| Model Token Rate | tokens/min | meter_envoy_ai_gw_model_token_rate | Token rate by model |
87-
| Model Latency Avg | ms | meter_envoy_ai_gw_model_latency_avg | Latency by model |
88-
| Model TTFT Avg | ms | meter_envoy_ai_gw_model_ttft_avg | TTFT by model |
89-
| Model TPOT Avg | ms | meter_envoy_ai_gw_model_tpot_avg | TPOT by model |
84+
| Monitoring Panel | Unit | Metric Name | Description |
85+
|-------------------|------------|-------------------------------------|---------------------|
86+
| Model Request CPM | calls/min | meter_envoy_ai_gw_model_request_cpm | Requests by model |
87+
| Model Token Rate | tokens/min | meter_envoy_ai_gw_model_token_rate | Token rate by model |
88+
| Model Latency Avg | ms | meter_envoy_ai_gw_model_latency_avg | Latency by model |
89+
| Model TTFT Avg | ms | meter_envoy_ai_gw_model_ttft_avg | TTFT by model |
90+
| Model TPOT Avg | ms | meter_envoy_ai_gw_model_tpot_avg | TPOT by model |
9091

9192
#### Instance Metrics
9293

9394
All service-level metrics are also available per instance (pod) with `meter_envoy_ai_gw_instance_` prefix,
9495
including per-provider and per-model breakdowns.
9596

97+
### MCP Metrics
98+
99+
When the AI Gateway is configured with MCP (Model Context Protocol) routes, SkyWalking collects
100+
MCP-specific metrics. These appear in the **MCP** tab on the service and instance dashboards.
101+
102+
#### MCP Service Metrics
103+
104+
| Monitoring Panel | Unit | Metric Name | Description |
105+
|---------------------------------------|-----------|---------------------------------------------------------|-------------------------------------------------------------------|
106+
| MCP Request CPM | calls/min | meter_envoy_ai_gw_mcp_request_cpm | MCP requests per minute |
107+
| MCP Request Latency Avg | ms | meter_envoy_ai_gw_mcp_request_latency_avg | Average MCP request duration |
108+
| MCP Request Latency Percentile | ms | meter_envoy_ai_gw_mcp_request_latency_percentile | P50/P75/P90/P95/P99 |
109+
| MCP Method CPM | calls/min | meter_envoy_ai_gw_mcp_method_cpm | Requests by MCP method (initialize, tools/list, tools/call, etc.) |
110+
| MCP Error CPM | calls/min | meter_envoy_ai_gw_mcp_error_cpm | MCP error requests per minute |
111+
| MCP Initialization Latency Avg | ms | meter_envoy_ai_gw_mcp_initialization_latency_avg | Average MCP session initialization time |
112+
| MCP Initialization Latency Percentile | ms | meter_envoy_ai_gw_mcp_initialization_latency_percentile | P50/P75/P90/P95/P99 |
113+
| MCP Capabilities CPM | calls/min | meter_envoy_ai_gw_mcp_capabilities_cpm | Capabilities negotiated by type |
114+
115+
#### MCP Backend Breakdown Metrics
116+
117+
| Monitoring Panel | Unit | Metric Name | Description |
118+
|--------------------------|-----------|----------------------------------------------------------|--------------------------------|
119+
| Backend Request CPM | calls/min | meter_envoy_ai_gw_mcp_backend_request_cpm | Requests by MCP backend |
120+
| Backend Latency Avg | ms | meter_envoy_ai_gw_mcp_backend_request_latency_avg | Latency by MCP backend |
121+
| Backend Method CPM | calls/min | meter_envoy_ai_gw_mcp_backend_method_cpm | Requests by backend and method |
122+
| Backend Error CPM | calls/min | meter_envoy_ai_gw_mcp_backend_error_cpm | Errors by MCP backend |
123+
| Backend Init Latency Avg | ms | meter_envoy_ai_gw_mcp_backend_initialization_latency_avg | Init latency by backend |
124+
125+
#### MCP Instance Metrics
126+
127+
All MCP service-level metrics are also available per instance with `meter_envoy_ai_gw_mcp_instance_` prefix.
128+
96129
### Access Log Sampling
97130

98-
The LAL rules apply a sampling policy to reduce storage:
131+
Access logs are tagged with `ai_route_type` (`llm` or `mcp`) for filtering in the log query UI.
132+
The `ai_route_type` tag is searchable by default.
133+
134+
**LLM route logs:**
99135
- **Error responses** (HTTP status >= 400) — always persisted.
100136
- **Upstream failures** — always persisted.
101137
- **High token cost** (>= 10,000 total tokens) — persisted for cost anomaly detection.
102138
- Normal successful responses with low token counts are dropped.
103139

104-
The token threshold can be adjusted in `lal/envoy-ai-gateway.yaml`.
140+
**MCP route logs:**
141+
- **Error responses** (HTTP status >= 400) — always persisted.
142+
- Normal MCP requests are dropped (MCP observability is covered by metrics).
143+
144+
The sampling policy can be adjusted in `lal/envoy-ai-gateway.yaml`.

oap-server/server-starter/src/main/resources/application.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ core:
118118
searchableTracesTags: ${SW_SEARCHABLE_TAG_KEYS:http.method,http.status_code,rpc.status_code,db.type,db.instance,mq.queue,mq.topic,mq.broker}
119119
# Define the set of log tag keys, which should be searchable through the GraphQL.
120120
# The max length of key=value should be less than 256 or will be dropped.
121-
searchableLogsTags: ${SW_SEARCHABLE_LOGS_TAG_KEYS:level,http.status_code}
121+
searchableLogsTags: ${SW_SEARCHABLE_LOGS_TAG_KEYS:level,http.status_code,ai_route_type}
122122
# Define the set of alarm tag keys, which should be searchable through the GraphQL.
123123
# The max length of key=value should be less than 256 or will be dropped.
124124
searchableAlarmTags: ${SW_SEARCHABLE_ALARM_TAG_KEYS:level}

oap-server/server-starter/src/main/resources/lal/envoy-ai-gateway.yaml

Lines changed: 57 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,29 +15,52 @@
1515

1616
# Envoy AI Gateway access log processing via OTLP.
1717
#
18-
# Sampling policy: only persist abnormal or expensive requests.
19-
# Normal 200 responses with low token count and no upstream failure are dropped.
18+
# Two rules: one for LLM route logs, one for MCP route logs.
19+
# LLM sampling: persist error responses (>= 400), upstream failures, or high-token requests (>= 10000).
20+
# MCP sampling: only persist error responses (>= 400).
21+
# Both tag ai_route_type for searchable filtering in the UI.
2022

2123
rules:
22-
- name: envoy-ai-gateway-access-log
24+
- name: envoy-ai-gateway-llm-access-log
2325
layer: ENVOY_AI_GATEWAY
2426
dsl: |
2527
filter {
26-
// Drop normal logs: response < 400, no upstream failure, low token count
28+
// Only process LLM route logs (gen_ai.request.model is always set for LLM routes, even on errors)
29+
if (tag("gen_ai.request.model") == "" || tag("gen_ai.request.model") == "-") {
30+
abort {}
31+
}
32+
33+
// Keep: error responses (>= 400), upstream failures, or high-token requests (>= 10000 total tokens)
34+
// Abort logs without response_code unless there is an upstream failure
35+
if (tag("response_code") == "" || tag("response_code") == "-") {
36+
if (tag("upstream_transport_failure_reason") == "" || tag("upstream_transport_failure_reason") == "-") {
37+
abort {}
38+
}
39+
}
40+
// For normal responses (< 400), check upstream failure and token cost
2741
if (tag("response_code") != "" && tag("response_code") != "-") {
2842
if (tag("response_code") as Integer < 400) {
43+
if (tag("upstream_transport_failure_reason") != "" && tag("upstream_transport_failure_reason") != "-") {
44+
// upstream failure — keep
45+
}
2946
if (tag("upstream_transport_failure_reason") == "" || tag("upstream_transport_failure_reason") == "-") {
47+
// no upstream failure — check token cost
3048
if (tag("gen_ai.usage.input_tokens") != "" && tag("gen_ai.usage.input_tokens") != "-"
3149
&& tag("gen_ai.usage.output_tokens") != "" && tag("gen_ai.usage.output_tokens") != "-") {
3250
if ((tag("gen_ai.usage.input_tokens") as Integer) + (tag("gen_ai.usage.output_tokens") as Integer) < 10000) {
3351
abort {}
3452
}
3553
}
54+
if (tag("gen_ai.usage.input_tokens") == "" || tag("gen_ai.usage.input_tokens") == "-"
55+
|| tag("gen_ai.usage.output_tokens") == "" || tag("gen_ai.usage.output_tokens") == "-") {
56+
abort {}
57+
}
3658
}
3759
}
3860
}
3961
4062
extractor {
63+
tag 'ai_route_type': "llm"
4164
tag 'gen_ai.request.model': tag("gen_ai.request.model")
4265
tag 'gen_ai.response.model': tag("gen_ai.response.model")
4366
tag 'gen_ai.provider.name': tag("gen_ai.provider.name")
@@ -50,3 +73,33 @@ rules:
5073
sink {
5174
}
5275
}
76+
77+
- name: envoy-ai-gateway-mcp-access-log
78+
layer: ENVOY_AI_GATEWAY
79+
dsl: |
80+
filter {
81+
// Only process MCP route logs
82+
if (tag("mcp.method.name") == "" || tag("mcp.method.name") == "-") {
83+
abort {}
84+
}
85+
86+
// Only persist error responses (>= 400)
87+
if (tag("response_code") == "" || tag("response_code") == "-") {
88+
abort {}
89+
}
90+
if (tag("response_code") as Integer < 400) {
91+
abort {}
92+
}
93+
94+
extractor {
95+
tag 'ai_route_type': "mcp"
96+
tag 'mcp.method.name': tag("mcp.method.name")
97+
tag 'mcp.provider.name': tag("mcp.provider.name")
98+
tag 'mcp.session.id': tag("mcp.session.id")
99+
tag 'response_code': tag("response_code")
100+
tag 'duration': tag("duration")
101+
}
102+
103+
sink {
104+
}
105+
}

0 commit comments

Comments
 (0)