Skip to content

Commit c22d878

Browse files
committed
Document o11y for vMCP
Signed-off-by: Jeremy Drouillard <jeremy@stacklok.com>
1 parent 51bd506 commit c22d878

File tree

5 files changed

+137
-0
lines changed

5 files changed

+137
-0
lines changed

cmd/vmcp/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ The Virtual MCP Server (vmcp) is a standalone binary that aggregates multiple MC
1414
-**Session Management**: MCP protocol session tracking with TTL-based cleanup
1515
-**Health Endpoints**: `/health` and `/ping` for service monitoring
1616
-**Configuration Validation**: `vmcp validate` command for config verification
17+
-**Observability**: OpenTelemetry metrics and traces for backend operations and workflow executions
1718

1819
### In Progress
1920
- 🚧 **Incoming Authentication** (Issue #165): OIDC, local, anonymous authentication
@@ -121,6 +122,7 @@ vmcp uses a YAML configuration file to define:
121122
3. **Outgoing Authentication**: Virtual MCP → Backend API token exchange
122123
4. **Tool Aggregation**: Conflict resolution and filtering strategies
123124
5. **Operational Settings**: Timeouts, health checks, circuit breakers
125+
6. **Telemetry**: OpenTelemetry metrics/tracing and Prometheus endpoint
124126

125127
See [examples/vmcp-config.yaml](../../examples/vmcp-config.yaml) for a complete example.
126128

docs/observability.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,3 +84,10 @@ The telemetry middleware:
8484

8585
This provides end-to-end visibility across the entire request lifecycle while
8686
maintaining the modular architecture of ToolHive's middleware system.
87+
88+
## Virtual MCP Server Telemetry
89+
90+
For observability in the Virtual MCP Server (vMCP), including backend request
91+
metrics, workflow execution telemetry, and distributed tracing, see the
92+
dedicated [Virtual MCP Server Observability](./operator/virtualmcpserver-observability.md)
93+
documentation.

docs/operator/virtualmcpserver-api.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -277,6 +277,40 @@ spec:
277277
cpu: "1000m"
278278
```
279279

280+
### `.spec.telemetry` (optional)
281+
282+
Configures OpenTelemetry-based observability for the Virtual MCP server, including distributed tracing, OTLP metrics export, and Prometheus metrics endpoint.
283+
284+
**Type**: `telemetry.Config`
285+
286+
**Fields**:
287+
- `endpoint` (string): OTLP collector endpoint (host:port format)
288+
- `serviceName` (string): Service name for telemetry
289+
- `serviceVersion` (string): Service version for telemetry
290+
- `tracingEnabled` (boolean): Enable distributed tracing
291+
- `metricsEnabled` (boolean): Enable OTLP metrics export
292+
- `samplingRate` (float64): Trace sampling rate (0.0-1.0)
293+
- `headers` (map[string]string): Authentication headers for OTLP endpoint
294+
- `insecure` (boolean): Use HTTP instead of HTTPS
295+
- `enablePrometheusMetricsPath` (boolean): Expose Prometheus /metrics endpoint
296+
- `environmentVariables` ([]string): Environment variable names to include as span attributes
297+
- `customAttributes` (map[string]string): Custom resource attributes for all telemetry signals
298+
299+
**Example**:
300+
```yaml
301+
spec:
302+
telemetry:
303+
endpoint: "otel-collector:4317"
304+
serviceName: "my-vmcp"
305+
tracingEnabled: true
306+
metricsEnabled: true
307+
samplingRate: 0.1
308+
insecure: true
309+
enablePrometheusMetricsPath: true
310+
```
311+
312+
For details on what metrics and traces are emitted, see the [Virtual MCP Server Observability](./virtualmcpserver-observability.md) documentation.
313+
280314
## Status Fields
281315

282316
### `.status.conditions`
@@ -437,6 +471,14 @@ spec:
437471
failureThreshold: 5
438472
timeout: 60s
439473
474+
# Observability
475+
telemetry:
476+
endpoint: "otel-collector:4317"
477+
tracingEnabled: true
478+
metricsEnabled: true
479+
samplingRate: 0.1
480+
enablePrometheusMetricsPath: true
481+
440482
status:
441483
phase: Ready
442484
message: "Virtual MCP serving 3 backends with 15 tools"
@@ -502,4 +544,5 @@ The VirtualMCPServer CRD includes comprehensive validation:
502544
- [MCPServer](./mcpserver-api.md): Individual MCP server instances
503545
- [MCPExternalAuthConfig](./mcpexternalauthconfig-api.md): External authentication configuration
504546
- [MCPToolConfig](./toolconfig-api.md): Tool filtering and renaming configuration
547+
- [Virtual MCP Server Observability](./virtualmcpserver-observability.md): Telemetry and metrics documentation
505548
- [Virtual MCP Proposal](../proposals/THV-2106-virtual-mcp-server.md): Complete design proposal
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Virtual MCP Server Observability
2+
3+
This document describes the observability for the Virtual MCP
4+
Server (vMCP), which aggregates multiple backend MCP servers into a unified
5+
interface. The vMCP provides OpenTelemetry-based instrumentation for monitoring
6+
backend operations and composite tool workflow executions.
7+
8+
For general ToolHive observability concepts and proxy runner telemetry, see the
9+
main [Observability and Telemetry](../observability.md) documentation.
10+
11+
## Overview
12+
13+
The vMCP telemetry provides visibility into:
14+
15+
1. **Backend operations**: Track requests to individual backend MCP servers
16+
including tool calls, resource reads, prompt retrieval, and capability listing
17+
2. **Workflow executions**: Monitor composite tool workflow performance and errors
18+
3. **Distributed tracing**: Correlate requests across the vMCP and its backends
19+
20+
The vMCP uses a decorator pattern to wrap backend clients and workflow executors
21+
with telemetry instrumentation. This approach provides consistent metrics and
22+
tracing without modifying the core business logic.
23+
24+
The implementation of both metrics and traces can be found in `pkg/vmcp/server/telemetry.go`.
25+
26+
## Metrics
27+
28+
The vMCP emits metrics for backend operations and workflow executions. All
29+
metrics use the `toolhive_vmcp_` prefix.
30+
31+
**Backend metrics** track requests to individual backend MCP servers, including
32+
request counts, error counts, and request duration histograms. These metrics
33+
include attributes identifying the target backend (workload ID, name, URL,
34+
transport type) and the action being performed (tool call, resource read, etc.).
35+
36+
**Workflow metrics** track composite tool workflow executions, including
37+
execution counts, error counts, and duration histograms. These metrics include
38+
the workflow name as an attribute.
39+
40+
## Distributed Tracing
41+
42+
The vMCP creates spans for each individual backend operation as well as workflow executions, enabling the attribution of workflow exection errors or latency to specific tool calls.
43+
44+
45+
## Configuration
46+
47+
Configure telemetry in the `VirtualMCPServer` resource using the `spec.telemetry`
48+
field:
49+
50+
```yaml
51+
apiVersion: toolhive.stacklok.dev/v1alpha1
52+
kind: VirtualMCPServer
53+
metadata:
54+
name: my-vmcp
55+
spec:
56+
groupRef:
57+
name: my-group
58+
telemetry:
59+
endpoint: "otel-collector:4317"
60+
serviceName: "my-vmcp"
61+
tracingEnabled: true
62+
metricsEnabled: true
63+
samplingRate: 0.1
64+
insecure: true
65+
enablePrometheusMetricsPath: true
66+
```
67+
68+
See the [VirtualMCPServer API reference](./virtualmcpserver-api.md) for complete
69+
CRD documentation.
70+
71+
## Related Documentation
72+
73+
- [Observability and Telemetry](../observability.md) - Main ToolHive observability documentation
74+
- [VirtualMCPServer API Reference](./virtualmcpserver-api.md) - Complete CRD specification

examples/vmcp-config.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,3 +187,14 @@ operational:
187187
# environment: "{{.steps.confirm_deploy.content.environment}}"
188188
# depends_on: ["confirm_deploy"]
189189
# condition: "{{.steps.confirm_deploy.action == 'accept'}}"
190+
191+
# ===== OBSERVABILITY =====
192+
# OpenTelemetry-based metrics and tracing for backend operations and workflows
193+
telemetry:
194+
endpoint: "localhost:4317" # OTLP collector endpoint
195+
serviceName: "engineering-vmcp"
196+
tracingEnabled: true
197+
metricsEnabled: true
198+
samplingRate: 0.1 # 10% sampling
199+
insecure: true # Use HTTP instead of HTTPS
200+
enablePrometheusMetricsPath: true # Expose /metrics endpoint

0 commit comments

Comments
 (0)