|
| 1 | +# Health Checks |
| 2 | + |
| 3 | +The Nexus Broker continuously monitors integration health across two dimensions: **provider-level** (is the upstream API alive?) and **connection-level** (is this user's credential still valid?). Both run as background workers inside the broker process. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Background Workers |
| 8 | + |
| 9 | +### HealthWorker — Provider-Level (5-minute interval) |
| 10 | + |
| 11 | +Probes every registered OAuth2 provider by sending a synthetic `invalid_grant` request to its `token_url`. This deliberately bad request tells us whether the provider's API is reachable and responding to OAuth traffic without requiring a real user credential. |
| 12 | + |
| 13 | +| Provider Response | Status Set | |
| 14 | +|-------------------|------------| |
| 15 | +| `400 Bad Request` or `401 Unauthorized` | `healthy` — API is alive and rejecting correctly | |
| 16 | +| `5xx Server Error` | `unhealthy` — API is down | |
| 17 | +| `200 OK` (unexpected for invalid grant) | `degraded` — API behaving abnormally | |
| 18 | +| Network error / timeout | `unhealthy` | |
| 19 | +| No `token_url` configured | `unknown` | |
| 20 | + |
| 21 | +For non-OAuth2 providers (API key, basic auth), the worker makes a `HEAD` request to `user_info_endpoint` or `api_base_url`. Any non-5xx response is treated as `healthy`. |
| 22 | + |
| 23 | +**Concurrency:** max 10 providers checked concurrently (semaphore + WaitGroup). |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +### ConnectionHealthWorker — Connection-Level (1-minute interval) |
| 28 | + |
| 29 | +Validates individual user connections in batches of 100 on a fixed ticker, prioritising those never checked or longest overdue. Each check has a 15-second timeout. A shared `http.Client` is reused across checks for connection pooling. |
| 30 | + |
| 31 | +| Auth Type | Check Method | |
| 32 | +|-----------|-------------| |
| 33 | +| `oauth2` | Attempt a background token refresh via `ConnectionService.Refresh` | |
| 34 | +| `api_key` | Decrypt credential, extract `api_key` field, `GET` to `user_info_endpoint` using provider's configured `AuthHeader` | |
| 35 | +| `basic_auth` | Decrypt credential, extract `username`/`password`, `GET` to `user_info_endpoint` with `Authorization: Basic` | |
| 36 | +| No endpoint configured | Mark `unknown` | |
| 37 | + |
| 38 | +**OAuth2 status code handling:** The worker inspects `RefreshResponse.StatusCode` to distinguish definitive credential errors from transient failures: |
| 39 | + |
| 40 | +| Upstream Status | `health_status` set | `connection.status` changed? | |
| 41 | +|-----------------|--------------------|-----------------------------| |
| 42 | +| Refresh succeeds | `healthy` | No | |
| 43 | +| 400 / 401 (invalid_grant, revoked) | `expired` | Yes → `expired` (if provider healthy) | |
| 44 | +| 403 (scope issue) | `degraded` | No | |
| 45 | +| 5xx (upstream error) | `unhealthy` | No | |
| 46 | +| Network error / nil response | `degraded` | No | |
| 47 | + |
| 48 | +**Provider shielding:** Before expiring a connection, the worker cross-references the upstream provider's `health_status`. If the provider is `unhealthy` or `degraded`, the connection is marked `unhealthy` (retriable) instead of `expired` (terminal). This prevents mass-expiration during transient upstream outages. |
| 49 | + |
| 50 | +**Error handling:** If `UpdateStatus` fails when expiring a connection, the worker logs the error and skips the `health_status` write to avoid leaving the connection in an inconsistent state. |
| 51 | + |
| 52 | +**Concurrency:** max 20 connections checked concurrently (semaphore + WaitGroup). |
| 53 | + |
| 54 | +--- |
| 55 | + |
| 56 | +## `health_status` Values |
| 57 | + |
| 58 | +Both `provider_profiles` and `connections` share the same status vocabulary: |
| 59 | + |
| 60 | +| Value | Meaning | |
| 61 | +|-------|---------| |
| 62 | +| `healthy` | Last check passed | |
| 63 | +| `unhealthy` | Last check failed — retriable (transient upstream or provider-shielded) | |
| 64 | +| `degraded` | Partial failure — scope issues, network errors, or internal errors where credential validity is unknown | |
| 65 | +| `expired` | Credential confirmed invalid (400/401) — user must re-authenticate | |
| 66 | +| `unknown` | Not yet checked, or not enough information to check | |
| 67 | + |
| 68 | +--- |
| 69 | + |
| 70 | +## API Endpoints |
| 71 | + |
| 72 | +### `GET /providers/health` |
| 73 | +Returns the health status of all registered providers. No credentials are included. |
| 74 | + |
| 75 | +```http |
| 76 | +GET /providers/health |
| 77 | +Authorization: X-API-Key <key> |
| 78 | +``` |
| 79 | + |
| 80 | +```json |
| 81 | +[ |
| 82 | + { |
| 83 | + "id": "uuid", |
| 84 | + "name": "google", |
| 85 | + "health_status": "healthy", |
| 86 | + "last_health_check_at": "2026-05-19T07:00:00Z", |
| 87 | + "health_message": "" |
| 88 | + }, |
| 89 | + { |
| 90 | + "id": "uuid", |
| 91 | + "name": "stripe", |
| 92 | + "health_status": "unhealthy", |
| 93 | + "last_health_check_at": "2026-05-19T07:05:00Z", |
| 94 | + "health_message": "upstream returned 503" |
| 95 | + } |
| 96 | +] |
| 97 | +``` |
| 98 | + |
| 99 | +Returns `[]` (not `null`) when no providers exist. |
| 100 | + |
| 101 | +--- |
| 102 | + |
| 103 | +### `GET /connections?workspace_id={workspace_id}` |
| 104 | +Returns all non-pending connections for a workspace with health status. No credentials or tokens are included. |
| 105 | + |
| 106 | +```http |
| 107 | +GET /connections?workspace_id=ws-123 |
| 108 | +Authorization: X-API-Key <key> |
| 109 | +``` |
| 110 | + |
| 111 | +```json |
| 112 | +[ |
| 113 | + { |
| 114 | + "id": "uuid", |
| 115 | + "provider_id": "uuid", |
| 116 | + "provider_name": "google", |
| 117 | + "auth_type": "oauth2", |
| 118 | + "status": "active", |
| 119 | + "scopes": ["email", "calendar.read"], |
| 120 | + "health_status": "healthy", |
| 121 | + "last_health_check_at": "2026-05-19T07:00:00Z", |
| 122 | + "created_at": "2026-05-01T00:00:00Z", |
| 123 | + "updated_at": "2026-05-19T07:00:00Z" |
| 124 | + } |
| 125 | +] |
| 126 | +``` |
| 127 | + |
| 128 | +**Use case:** Rendering a connections dashboard with live health indicators. |
| 129 | + |
| 130 | +--- |
| 131 | + |
| 132 | +### `GET /connections/{id}/token` (enhanced) |
| 133 | +The existing token endpoint now includes `health_status` in its response alongside credentials and strategy. |
| 134 | + |
| 135 | +```json |
| 136 | +{ |
| 137 | + "strategy": { "type": "oauth2" }, |
| 138 | + "credentials": { "access_token": "..." }, |
| 139 | + "health_status": "healthy" |
| 140 | +} |
| 141 | +``` |
| 142 | + |
| 143 | +**Use case:** Showing an inline warning or re-auth prompt when consuming a credential. |
| 144 | + |
| 145 | +--- |
| 146 | + |
| 147 | +## Worker Mode |
| 148 | + |
| 149 | +Health workers run inside the standard broker process. For deployments that need to separate HTTP serving from background polling, pass `--worker-only` to the binary: |
| 150 | + |
| 151 | +```bash |
| 152 | +nexus-broker --worker-only |
| 153 | +``` |
| 154 | + |
| 155 | +In this mode, the HTTP server does not start. The process listens for `SIGINT`/`SIGTERM` and cancels the worker context, signalling in-flight checks to stop. Note: the current implementation does not explicitly wait for worker goroutines to complete before exiting. |
| 156 | + |
| 157 | +The same Docker image and environment variables are used — just override the container command. |
| 158 | + |
| 159 | +--- |
| 160 | + |
| 161 | +## Database Schema |
| 162 | + |
| 163 | +```sql |
| 164 | +-- provider_profiles |
| 165 | +ALTER TABLE provider_profiles |
| 166 | + ADD COLUMN last_health_check_at TIMESTAMP WITH TIME ZONE, |
| 167 | + ADD COLUMN health_status VARCHAR(50) DEFAULT 'unknown', |
| 168 | + ADD COLUMN health_message TEXT; |
| 169 | + |
| 170 | +-- connections |
| 171 | +ALTER TABLE connections |
| 172 | + ADD COLUMN last_health_check_at TIMESTAMP WITH TIME ZONE, |
| 173 | + ADD COLUMN health_status VARCHAR(50) DEFAULT 'unknown'; |
| 174 | + |
| 175 | +-- Performance index for GetForHealthCheck query |
| 176 | +CREATE INDEX IF NOT EXISTS idx_connections_health_check |
| 177 | + ON connections (status, last_health_check_at ASC NULLS FIRST) |
| 178 | + WHERE status = 'active'; |
| 179 | +``` |
| 180 | + |
| 181 | +Migrations: `13_add_provider_health.sql`, `14_add_connection_health.sql`, `15_add_connection_health_index.sql`. |
0 commit comments