Skip to content

fix(gateways): resolve authorization_code OAuth gateway offline issues#5244

Open
bogdanmariusc10 wants to merge 3 commits into
mainfrom
5237-bug-authorization_code-oauth-gateways-become-permanently-offline-due-to-health-check-token-assumptions-and-destructive-refresh-token-handling
Open

fix(gateways): resolve authorization_code OAuth gateway offline issues#5244
bogdanmariusc10 wants to merge 3 commits into
mainfrom
5237-bug-authorization_code-oauth-gateways-become-permanently-offline-due-to-health-check-token-assumptions-and-destructive-refresh-token-handling

Conversation

@bogdanmariusc10

Copy link
Copy Markdown
Collaborator

🔗 Related Issue

Closes #5237


📝 Summary

This PR fixes critical issues where OAuth authorization_code gateways would become permanently offline due to health check assumptions and destructive token handling behavior.

Problem:
authorization_code gateways (where users authorize individually) were failing health checks when no platform admin token existed, causing cascading failures:

  1. Health checks incorrectly failed when no token was available (expected state for user-specific auth)
  2. 401/403 responses were treated as connectivity failures instead of auth challenges
  3. Token refresh failures would permanently delete valid tokens due to overly broad error matching
  4. client_secret decryption failures fell back silently to using ciphertext as credentials
  5. omit_resource flag was ignored during token refresh, breaking IdPs that reject the resource parameter

Solution:

  • Decouple health checks from token ownership - missing tokens don't fail health checks
  • Treat 401/403 responses as "gateway reachable" (auth issue, not connectivity failure)
  • Reactivate previously-unreachable gateways when they respond with 401/403
  • Update last_seen timestamp even on auth failures to reflect gateway responsiveness
  • Make client_secret decryption failures explicit (raise OAuthError) instead of silent fallback
  • Respect omit_resource flag to prevent resource parameter injection when disabled
  • Only delete tokens on RFC 6749 §5.2 invalid_grant errors (permanent failure), not transient errors

🏷️ Type of Change

  • Bug fix
  • Feature / Enhancement
  • Documentation
  • Refactor
  • Chore (deps, CI, tooling)
  • Other (describe below)

🧪 Verification

Check Command Status
Lint suite make lint ✅ PASS
Unit tests make test ✅ PASS
Coverage ≥ 80% make coverage ✅ PASS

Test Results:

✅ test_oauth_authorization_code_health.py: 8/8 passed
✅ test_gateway_service_401_reactivation.py: 5/5 passed  
✅ test_gateway_service_health_oauth.py: 15/15 passed  
✅ test_token_storage_service.py: 52/52 passed
✅ test_oauth_manager.py: 131/131 passed
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ TOTAL: 172 OAuth-related tests passing

✅ Checklist

  • Code formatted (make black isort pre-commit)
  • Tests added/updated for changes
  • Documentation updated
  • No secrets or credentials committed

📓 Notes

Files Changed

Core Implementation (2 files):

  • mcpgateway/services/gateway_service.py - Health check logic for authorization_code flow

    • Lines 3824-3850: Skip authentication when no token available
    • Lines 3972-4018: Distinguish auth failures (401/403) from connectivity failures
    • Lines 3996-3998: Reactivate previously-unreachable gateways on 401/403
    • Lines 4005-4008: Update last_seen even on auth failures + exception handling
  • mcpgateway/services/token_storage_service.py - Token refresh improvements

    • Lines 298-313: Explicit client_secret decryption with error handling
    • Lines 350-373: omit_resource flag enforcement
    • Lines 439-472: Selective token deletion (only invalid_grant)

Test Updates (4 files):

  • tests/unit/mcpgateway/services/test_gateway_service.py - Added httpx import, updated expectations
  • tests/unit/mcpgateway/services/test_gateway_service_health_oauth.py - OAuth health check test updates
  • tests/unit/mcpgateway/services/test_token_storage_service.py - Token refresh test updates
  • tests/unit/mcpgateway/test_oauth_manager.py - Changed to use OAuthError with invalid_grant

New Test Files (2 files):

  • tests/unit/mcpgateway/services/test_oauth_authorization_code_health.py (8 tests)

    • Test health check behavior documentation
    • Test client_secret decryption failure handling
    • Test omit_resource flag prevention
    • Test resource injection when omit_resource=false
    • Test invalid_grant token deletion
    • Test invalid_client token preservation
    • Test transient error token preservation
  • tests/unit/mcpgateway/services/test_gateway_service_401_reactivation.py (5 tests)

    • Test 401 reactivates previously unreachable gateway (coverage lines 3996-3998)
    • Test 403 reactivates previously unreachable gateway
    • Test 401 last_seen update failure handling (coverage lines 4007-4008)
    • Test 401 on already-reachable gateway (skip reactivation, update last_seen)
    • Test 401 on disabled gateway (skip reactivation, update last_seen)

Technical Details

Health Check Behavior:

  • authorization_code gateways no longer depend on platform admin token availability
  • 401/403 responses now treated as "gateway is reachable but unauthorized" (healthy state)
  • Previously-unreachable gateways are reactivated when they respond with auth challenges
  • Gateway last_seen timestamp updated on all responses (including auth failures)

Token Deletion Logic (RFC 6749 Compliance):

# OLD (bug): Any error with "invalid" or "expired" deleted tokens
if "invalid" in error_str or "expired" in error_str:
    delete_token()

# NEW (fixed): Only RFC 6749 §5.2 invalid_grant deletes tokens
except OAuthError as e:
    if "invalid_grant" in str(e).lower():
        delete_token()  # Permanent failure - refresh token revoked
    else:
        preserve_token()  # Configuration error or transient failure
except Exception:
    preserve_token()  # Network errors, not token issues

client_secret Handling:

# OLD (bug): Silent fallback on decryption failure
try:
    decrypt(client_secret)
except:
    pass  # Uses encrypted value as credential!

# NEW (fixed): Explicit error on decryption failure
if is_encrypted(client_secret):
    try:
        decrypt(client_secret)
    except Exception as e:
        raise OAuthError(f"Failed to decrypt client_secret: {e}")

omit_resource Flag:

# NEW: Check flag BEFORE resource injection
omit_resource = oauth_config.get("omit_resource", False)
if omit_resource:
    oauth_config.pop("resource", None)
    logger.debug("Omitting resource parameter per omit_resource=true")
else:
    # Only inject resource when flag is false/absent
    if gateway.url:
        oauth_config["resource"] = normalize_resource(gateway.url)

Impact

Before (broken):

  • authorization_code gateways marked unhealthy when no platform admin token exists
  • 401/403 responses trigger _handle_gateway_failure() → gateway goes offline
  • Any OAuth error matching "invalid" or "expired" permanently deletes tokens
  • Decryption failures use ciphertext as client_secret → authentication always fails
  • omit_resource flag ignored → breaks IdPs that reject resource parameter

After (fixed):

  • Health checks verify service reachability independent of token ownership
  • 401/403 responses treated as "gateway reachable" → stays online
  • Only invalid_grant deletes tokens (per RFC 6749) → transient errors don't lose tokens
  • Decryption failures raise explicit errors → clear diagnosis of configuration issues
  • omit_resource flag respected → works with all IdP configurations

Testing Strategy

  1. Unit tests for individual components (token refresh, decryption, resource handling)
  2. Integration-style tests for health check paths (401 reactivation, last_seen updates)
  3. Regression tests to ensure existing OAuth functionality still works

Fix bug #5237 where authorization_code OAuth gateways would become
permanently offline due to health check assumptions and destructive
token handling.

Changes:
- Decouple health checks from token ownership for authorization_code flow
- Treat 401/403 responses as 'gateway reachable' instead of failures
- Make client_secret decryption failures explicit (raise OAuthError)
- Respect omit_resource flag during token refresh
- Only delete tokens on invalid_grant errors (RFC 6749 §5.2)
- Add gateway reactivation on 401/403 when previously unreachable
- Update last_seen timestamp even on auth failures

Tests:
- Added test_oauth_authorization_code_health.py (8 tests)
- Added test_gateway_service_401_reactivation.py (5 tests)
- Updated existing OAuth tests to match new behavior
- All 172 OAuth-related tests passing with no regressions

Closes #5237

Signed-off-by: Bogdan-Marius-Catanus <bogdan-marius.catanus@ibm.com>
@bogdanmariusc10 bogdanmariusc10 added bug Something isn't working MUST P1: Non-negotiable, critical requirements without which the product is non-functional or unsafe labels Jun 16, 2026
@bogdanmariusc10 bogdanmariusc10 added the api REST API Related item label Jun 16, 2026
bogdanmariusc10 and others added 2 commits June 16, 2026 13:58
…ecome-permanently-offline-due-to-health-check-token-assumptions-and-destructive-refresh-token-handling
Signed-off-by: Bogdan-Marius-Catanus <bogdan-marius.catanus@ibm.com>
@omprak

omprak commented Jun 26, 2026

Copy link
Copy Markdown

@ja8zyjits @bogdanmariusc10 Our production onboarding is blocked due to this critical issue . Assuming it is blocking every Oauth based MCP server any plan to release it as HotFix tag ?

Note - if you any have workaround solutions and approach IBM is doing for Oauth based MCP servers without this bug fix please let us know .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api REST API Related item bug Something isn't working MUST P1: Non-negotiable, critical requirements without which the product is non-functional or unsafe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: authorization_code OAuth gateways become permanently offline due to health-check token assumptions and destructive refresh-token handling

4 participants