Skip to content

[BUG][OBSERVABILITY]: Metrics not scoped to server_id — showing aggregated totals #3642

@rakdutta

Description

@rakdutta

Reference issue ##3598

Bug Summary

When querying /servers/{server_id}/tools?include_metrics=true, the API returns incorrect metrics that are aggregated across ALL servers using the tool, instead of metrics specific to the requested server.

Steps to Reproduce

  1. Create a tool (e.g., weather-tool)
  2. Associate the tool with two virtual servers: server1 and server2
  3. Execute the tool:
    • 2 times via server1
    • 3 times via server2
    • 1 time via Admin UI (no server context)
  4. Query metrics for server1:
    curl -X GET "http://localhost:4444/servers/server1/tools?include_metrics=true"

Expected Behavior

The API should return metrics scoped to server1 only:

{
  "name": "weather-tool",
  "metrics": {
    "total_executions": 2,
    "successful_executions": 2,
    "failed_executions": 0
  }
}

Actual Behavior

The API returns aggregated metrics across all servers:

{
  "name": "weather-tool",
  "metrics": {
    "total_executions": 6,  // WRONG: 2 + 3 + 1 = 6
    "successful_executions": 6,
    "failed_executions": 0
  }
}

Impact

  • Severity: High
  • Affected Endpoints:
    • GET /servers/{server_id}/tools?include_metrics=true
    • GET /servers/{server_id}/resources?include_metrics=true
    • GET /servers/{server_id}/prompts?include_metrics=true
  • User Impact:
    • Cannot track per-server SLAs or performance
    • Multi-tenant deployments cannot isolate server-specific issues
    • Misleading data for capacity planning and troubleshooting

Root Cause

The metric tables are missing a server_id column:

  • tool_metrics table only has tool_id, no server_id
  • resource_metrics table only has resource_id, no server_id
  • prompt_metrics table only has prompt_id, no server_id
  • Hourly rollup tables (*_metrics_hourly) have the same issue

When metrics are recorded during tool/resource/prompt execution, the system does not capture which server was used. The metrics_summary property in mcpgateway/db.py aggregates ALL metrics for the entity without filtering by server.

Environment

  • Version: [v1.0.0-RC2]

Proposed Fix

1. Database Schema Changes (Migration Required)

Add server_id column to all metric tables:

-- Raw metrics tables
ALTER TABLE tool_metrics ADD COLUMN server_id VARCHAR(36) NULLABLE;
ALTER TABLE resource_metrics ADD COLUMN server_id VARCHAR(36) NULLABLE;
ALTER TABLE prompt_metrics ADD COLUMN server_id VARCHAR(36) NULLABLE;

-- Hourly rollup tables
ALTER TABLE tool_metrics_hourly ADD COLUMN server_id VARCHAR(36) NULLABLE;
ALTER TABLE resource_metrics_hourly ADD COLUMN server_id VARCHAR(36) NULLABLE;
ALTER TABLE prompt_metrics_hourly ADD COLUMN server_id VARCHAR(36) NULLABLE;

-- Add indexes for query performance
CREATE INDEX idx_tool_metrics_server_id ON tool_metrics(server_id);
CREATE INDEX idx_resource_metrics_server_id ON resource_metrics(server_id);
CREATE INDEX idx_prompt_metrics_server_id ON prompt_metrics(server_id);

Note: server_id is NULLABLE to support:

  • Legacy metrics (existing data before fix)
  • Admin UI executions (no server context)
  • Direct invocations outside server context

2. Code Changes Required

A. Update ORM Models (mcpgateway/db.py)

class ToolMetric(Base):
    # ... existing fields ...
    server_id: Mapped[Optional[str]] = mapped_column(String(36), nullable=True, index=True)

class ResourceMetric(Base):
    # ... existing fields ...
    server_id: Mapped[Optional[str]] = mapped_column(String(36), nullable=True, index=True)

class PromptMetric(Base):
    # ... existing fields ...
    server_id: Mapped[Optional[str]] = mapped_column(String(36), nullable=True, index=True)

B. Update Metric Recording

Capture server_id in all execution paths:

  • JSON-RPC handler (/rpc)
  • SSE transport (/sse)
  • WebSocket transport
  • Streamable HTTP transport (/mcp)
  • Direct tool execution endpoints

C. Update Metrics Aggregation

Modify metrics_summary property to filter by server_id:

# In Tool/Resource/Prompt models (mcpgateway/db.py)
def metrics_summary(self, server_id: Optional[str] = None) -> Dict[str, Any]:
    """Aggregated metrics, optionally filtered by server_id."""
    # Add WHERE server_id = ? to SQL queries
    # Filter in-memory metrics by server_id

D. Update Service Methods

Pass server_id to conversion methods:

# In mcpgateway/services/tool_service.py - list_server_tools()
for tool in tools:
    result.append(
        self.convert_tool_to_read(
            tool,
            include_metrics=include_metrics,
            server_id=server_id,  # NEW: Pass server context for filtering
            # ... other params ...
        )
    )

3. Backward Compatibility

  • Existing metrics will have server_id = NULL
  • Queries without server_id filter will aggregate all metrics (preserves current behavior for non-server contexts)
  • Queries with server_id filter will only include matching metrics (fixes the bug)
  • Historical data remains queryable but not server-scoped

Files to Modify

  1. Database Migration: mcpgateway/alembic/versions/XXXX_add_server_id_to_metrics.py
  2. ORM Models: mcpgateway/db.py (ToolMetric, ResourceMetric, PromptMetric, *MetricsHourly classes)
  3. Metric Recording:
    • mcpgateway/services/tool_service.py (record_tool_metric method)
    • mcpgateway/services/resource_service.py (record_resource_metric method)
    • mcpgateway/services/prompt_service.py (record_prompt_metric method)
  4. Metric Aggregation: mcpgateway/db.py (metrics_summary properties in Tool/Resource/Prompt classes)
  5. Service Conversions:
    • mcpgateway/services/tool_service.py (convert_tool_to_read, list_server_tools)
    • mcpgateway/services/resource_service.py (convert_resource_to_read, list_server_resources)
    • mcpgateway/services/prompt_service.py (convert_prompt_to_read, list_server_prompts)
  6. Rollup Service: mcpgateway/services/metrics_rollup_service.py (aggregate by server_id)

Testing Checklist

  • Verify metrics are recorded with correct server_id
  • Verify /servers/{server_id}/tools?include_metrics=true returns server-scoped metrics
  • Verify /servers/{server_id}/resources?include_metrics=true returns server-scoped metrics
  • Verify /servers/{server_id}/prompts?include_metrics=true returns server-scoped metrics
  • Verify Admin UI executions (no server) still work with server_id = NULL
  • Verify backward compatibility with existing NULL server_id metrics
  • Verify multi-server scenarios (same tool on 2+ servers)
  • Verify hourly rollup aggregation includes server_id grouping

Workaround

None available. Metrics are currently incorrect for multi-server deployments.

Related Issues

  • Affects all multi-server deployments
  • Impacts observability and monitoring accuracy
  • Prevents proper SLA tracking per virtual server

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasebugSomething isn't workingobservabilityObservability, logging, monitoring

Type

No fields configured for Bug.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions