Skip to content

feat(api): add leader instance identification with operational metadata in /health endpoint#3975

Open
bogdanmariusc10 wants to merge 4 commits intomainfrom
3838-featureapi-enhancing-multi-instance-monitoring-with-leader-status-indicators
Open

feat(api): add leader instance identification with operational metadata in /health endpoint#3975
bogdanmariusc10 wants to merge 4 commits intomainfrom
3838-featureapi-enhancing-multi-instance-monitoring-with-leader-status-indicators

Conversation

@bogdanmariusc10
Copy link
Copy Markdown
Collaborator

@bogdanmariusc10 bogdanmariusc10 commented Apr 1, 2026

🔗 Related Issue

Closes #3838


📝 Summary

This PR implements leader instance identification and operational metadata mapping for MCP Gateway clusters, enabling operators and monitoring tools to easily identify which specific physical instance currently holds the cluster leadership.

Problem: Previously, the system used random UUIDs in Redis for leader election without mapping to any identifiable process information. In production environments with multiple instances, it was impossible to know which instance was the leader without destructive testing or temporary code patching.

Solution:

  • Store structured JSON metadata (port, PID, hostname, instance_id) in Redis instead of plain UUID
  • Add is_leader boolean field to /health endpoint for automated monitoring
  • Make /health endpoint async to properly check Redis leadership status
  • Maintain backward compatibility with legacy UUID-only format

Impact: SREs can now identify the leader with a single Redis command or health check, and monitoring tools (Prometheus/Grafana) can track leadership changes automatically.


🏷️ Type of Change

  • Bug fix
  • Feature / Enhancement
  • Documentation
  • Refactor
  • Chore (deps, CI, tooling)
  • Other (describe below)

🧪 Verification

Check Command Status
Lint suite make lint Pass ✅
Unit tests make test Pass ✅
Coverage ≥ 80% make coverage Pass ✅

Functional Testing:

  • ✅ Tested with 3 instances (ports 8001, 8002, 8003)
  • ✅ Verified /health endpoint returns correct is_leader status
  • ✅ Verified Redis stores JSON metadata with all operational fields
  • ✅ Tested failover: killed leader, confirmed new leader election within 15s
  • ✅ Verified backward compatibility with legacy UUID format

✅ Checklist

  • Code formatted (make black isort pre-commit)
  • Tests added/updated for changes
  • Documentation updated (if applicable)
  • No secrets or credentials committed

📓 Notes

Implementation Details

Files Modified:

  1. mcpgateway/services/gateway_service.py - Core implementation

    • Initialize _instance_metadata with port, PID, hostname, instance_id
    • Store JSON metadata in Redis during leader election
    • Parse JSON with backward compatibility for legacy UUID format
    • Async is_leader() method for checking leadership status
  2. mcpgateway/main.py - Health endpoint enhancement

    • Made /health endpoint async
    • Added is_leader boolean field to response
  3. tests/unit/mcpgateway/services/test_gateway_service_redis_leadership.py - Test updates

    • Updated to expect JSON metadata format
    • Verified all metadata fields are present
  4. tests/unit/mcpgateway/test_main.py & tests/unit/mcpgateway/test_main_extended.py

    • Updated health check tests to async
    • Added assertions for is_leader field

Example Usage

Check leader via health endpoint:

curl http://localhost:8001/health
# Returns: {"status":"healthy","is_leader":true,...}

curl http://localhost:8002/health
# Returns: {"status":"healthy","is_leader":false,...}

Check leader via Redis:

redis-cli GET gateway_service_leader
# Returns:
{
  "instance_id": "86688ed8-2948-418f-a9f6-e18761115260",
  "port": 8001,
  "pid": 18444,
  "hostname": "192.168.1.4"
}

Design Decisions

Async-only approach: Removed sync is_leader() method to avoid event loop conflicts in FastAPI context
JSON metadata format: Provides structured data for better observability and debugging
Backward compatibility: Gracefully handles both JSON and legacy UUID formats during transition
Fail-safe defaults: Returns sensible defaults when Redis is unavailable to avoid blocking operations

Tests Added (9 total)

tests/unit/mcpgateway/test_main.py

  1. test_health_check_leader_exception - Health check handles is_leader() exception, returns is_leader: false

tests/unit/mcpgateway/services/test_gateway_service_redis_leadership.py

TestGatewayServiceInitialization (3 tests):
2. test_init_with_invalid_port - Invalid port string defaults to 0
3. test_init_with_none_port - None port defaults to 0
4. test_init_with_valid_port - Valid port string converts to int

TestIsLeaderMethod (5 tests):
5. test_is_leader_with_json_metadata - Parses JSON metadata from Redis
6. test_is_leader_with_legacy_uuid - Handles legacy UUID-only format
7. test_is_leader_returns_false_when_not_leader - Returns False for non-leader
8. test_is_leader_returns_false_when_no_leader - Returns False when Redis has no leader
9. test_is_leader_returns_false_when_redis_unavailable - Returns False when Redis unavailable (bug fix)

Result: 15,833 tests passing, all coverage gaps addressed

Bogdan-Marius-Catanus added 3 commits April 1, 2026 16:28
- Add instance metadata (port, PID, hostname) to Redis leader election
- Expose is_leader boolean field in /health endpoint for cluster monitoring
- Store JSON metadata in Redis instead of plain UUID for operational visibility
- Add startup logging of instance metadata
- Make /health endpoint async to support Redis leader status checks
- Update tests to async for health check endpoints
- Maintain backward compatibility with legacy UUID-only format

Closes #3838

Signed-off-by: Bogdan-Marius-Catanus <bogdan-marius.catanus@ibm.com>
- Store instance metadata (port, PID, hostname, instance_id) in Redis as JSON
- Add is_leader boolean field to /health endpoint for monitoring
- Make /health endpoint async to properly check Redis leadership status
- Update tests to verify JSON metadata format and async health checks
- Maintain backward compatibility with legacy UUID-only format
- Remove unused is_leader_sync() method to avoid event loop conflicts

Closes #3838

Signed-off-by: Bogdan-Marius-Catanus <bogdan-marius.catanus@ibm.com>
@bogdanmariusc10 bogdanmariusc10 added this to the Release 1.1.0 milestone Apr 1, 2026
@bogdanmariusc10 bogdanmariusc10 added the enhancement New feature or request label Apr 1, 2026
@bogdanmariusc10 bogdanmariusc10 added the SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release label Apr 1, 2026
@bogdanmariusc10 bogdanmariusc10 added the api REST API Related item label Apr 1, 2026
- Add test for health check exception handling when is_leader() fails
- Add tests for GatewayService initialization with invalid/None port values
- Add comprehensive tests for is_leader() method covering JSON metadata, legacy UUID, and edge cases
- Fix is_leader() to return False when Redis is unavailable (was incorrectly returning True)
- All tests passing: 15,833 passed (9 new tests added)

Addresses coverage gaps identified in diff-cover report for lines:
- mcpgateway/main.py: 10346, 10348 (exception handling)
- mcpgateway/services/gateway_service.py: 478-479 (port conversion), 4133-4152 (is_leader logic)

Signed-off-by: Bogdan-Marius-Catanus <bogdan-marius.catanus@ibm.com>
@bogdanmariusc10 bogdanmariusc10 force-pushed the 3838-featureapi-enhancing-multi-instance-monitoring-with-leader-status-indicators branch from d2bd1b6 to d38d030 Compare April 1, 2026 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api REST API Related item enhancement New feature or request SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE][API]: Enhancing Multi-Instance Monitoring with Leader Status Indicators

3 participants