|
| 1 | +# Prometheus Metrics |
| 2 | + |
| 3 | +This application exposes Prometheus-compatible metrics on a separate port from the main API server. |
| 4 | + |
| 5 | +## Configuration |
| 6 | + |
| 7 | +The metrics server runs on a separate port configured via the `METRICS_PORT` environment variable: |
| 8 | + |
| 9 | +```bash |
| 10 | +# Default: 9090 |
| 11 | +METRICS_PORT=9090 |
| 12 | +``` |
| 13 | + |
| 14 | +Add this to your `.env` file. See `.env.sample` for reference. |
| 15 | + |
| 16 | +## Metrics Endpoint |
| 17 | + |
| 18 | +The metrics are served at: |
| 19 | + |
| 20 | +``` |
| 21 | +http://localhost:9090/metrics |
| 22 | +``` |
| 23 | + |
| 24 | +(Replace `9090` with your configured `METRICS_PORT` if different) |
| 25 | + |
| 26 | +## Available Metrics |
| 27 | + |
| 28 | +### Default Node.js Metrics |
| 29 | + |
| 30 | +The following default Node.js metrics are automatically collected: |
| 31 | + |
| 32 | +- **nodejs_version_info** - Node.js version information |
| 33 | +- **process_cpu_user_seconds_total** - Total user CPU time spent in seconds |
| 34 | +- **process_cpu_system_seconds_total** - Total system CPU time spent in seconds |
| 35 | +- **nodejs_heap_size_total_bytes** - Total heap size in bytes |
| 36 | +- **nodejs_heap_size_used_bytes** - Used heap size in bytes |
| 37 | +- **nodejs_external_memory_bytes** - External memory in bytes |
| 38 | +- **nodejs_heap_space_size_total_bytes** - Total heap space size in bytes |
| 39 | +- **nodejs_heap_space_size_used_bytes** - Used heap space size in bytes |
| 40 | +- **nodejs_eventloop_lag_seconds** - Event loop lag in seconds |
| 41 | +- **nodejs_eventloop_lag_min_seconds** - Minimum event loop lag |
| 42 | +- **nodejs_eventloop_lag_max_seconds** - Maximum event loop lag |
| 43 | +- **nodejs_eventloop_lag_mean_seconds** - Mean event loop lag |
| 44 | +- **nodejs_eventloop_lag_stddev_seconds** - Standard deviation of event loop lag |
| 45 | +- **nodejs_eventloop_lag_p50_seconds** - 50th percentile event loop lag |
| 46 | +- **nodejs_eventloop_lag_p90_seconds** - 90th percentile event loop lag |
| 47 | +- **nodejs_eventloop_lag_p99_seconds** - 99th percentile event loop lag |
| 48 | + |
| 49 | +### Custom HTTP Metrics |
| 50 | + |
| 51 | +#### http_request_duration_seconds (Histogram) |
| 52 | + |
| 53 | +Duration of HTTP requests in seconds, labeled by: |
| 54 | +- `method` - HTTP method (GET, POST, etc.) |
| 55 | +- `route` - Request route/path |
| 56 | +- `status_code` - HTTP status code |
| 57 | + |
| 58 | +Buckets: 0.01, 0.05, 0.1, 0.5, 1, 5, 10 seconds |
| 59 | + |
| 60 | +#### http_requests_total (Counter) |
| 61 | + |
| 62 | +Total number of HTTP requests, labeled by: |
| 63 | +- `method` - HTTP method (GET, POST, etc.) |
| 64 | +- `route` - Request route/path |
| 65 | +- `status_code` - HTTP status code |
| 66 | + |
| 67 | +### GraphQL Metrics |
| 68 | + |
| 69 | +#### hawk_gql_operation_duration_seconds (Histogram) |
| 70 | + |
| 71 | +Histogram of total GraphQL operation duration by operation name and type. |
| 72 | + |
| 73 | +Labels: |
| 74 | +- `operation_name` - Name of the GraphQL operation |
| 75 | +- `operation_type` - Type of operation (query, mutation, subscription) |
| 76 | + |
| 77 | +Buckets: 0.01, 0.05, 0.1, 0.5, 1, 5, 10 seconds |
| 78 | + |
| 79 | +**Purpose**: Identify slow API operations (P95/P99 latency). |
| 80 | + |
| 81 | +#### hawk_gql_operation_errors_total (Counter) |
| 82 | + |
| 83 | +Counter of failed GraphQL operations grouped by operation name and error class. |
| 84 | + |
| 85 | +Labels: |
| 86 | +- `operation_name` - Name of the GraphQL operation |
| 87 | +- `error_type` - Type/class of the error |
| 88 | + |
| 89 | +**Purpose**: Detect increased error rates and failing operations. |
| 90 | + |
| 91 | +#### hawk_gql_resolver_duration_seconds (Histogram) |
| 92 | + |
| 93 | +Histogram of resolver execution time per type, field, and operation. |
| 94 | + |
| 95 | +Labels: |
| 96 | +- `type_name` - GraphQL type name |
| 97 | +- `field_name` - Field name being resolved |
| 98 | +- `operation_name` - Name of the GraphQL operation |
| 99 | + |
| 100 | +Buckets: 0.01, 0.05, 0.1, 0.5, 1, 5 seconds |
| 101 | + |
| 102 | +**Purpose**: Find slow or CPU-intensive resolvers that degrade overall performance. |
| 103 | + |
| 104 | +### MongoDB Metrics |
| 105 | + |
| 106 | +#### hawk_mongo_command_duration_seconds (Histogram) |
| 107 | + |
| 108 | +Histogram of MongoDB command duration by command, collection family, and database. |
| 109 | + |
| 110 | +Labels: |
| 111 | +- `command` - MongoDB command name (find, insert, update, etc.) |
| 112 | +- `collection_family` - Collection family name (extracted from dynamic collection names to reduce cardinality) |
| 113 | +- `db` - Database name |
| 114 | + |
| 115 | +Buckets: 0.01, 0.05, 0.1, 0.5, 1, 5, 10 seconds |
| 116 | + |
| 117 | +**Purpose**: Detect slow queries and high-latency collections. |
| 118 | + |
| 119 | +**Note on Collection Families**: To reduce metric cardinality, dynamic collection names are grouped into families. For example: |
| 120 | +- `events:projectId` → `events` |
| 121 | +- `dailyEvents:projectId` → `dailyEvents` |
| 122 | +- `repetitions:projectId` → `repetitions` |
| 123 | +- `membership:userId` → `membership` |
| 124 | +- `team:workspaceId` → `team` |
| 125 | + |
| 126 | +This prevents metric explosion when dealing with thousands of projects, users, or workspaces, while still providing meaningful insights into collection performance patterns. |
| 127 | + |
| 128 | +#### hawk_mongo_command_errors_total (Counter) |
| 129 | + |
| 130 | +Counter of failed MongoDB commands grouped by command and error code. |
| 131 | + |
| 132 | +Labels: |
| 133 | +- `command` - MongoDB command name |
| 134 | +- `error_code` - MongoDB error code |
| 135 | + |
| 136 | +**Purpose**: Track transient or persistent database errors. |
| 137 | + |
| 138 | +## Testing |
| 139 | + |
| 140 | +### Manual Testing |
| 141 | + |
| 142 | +You can test the metrics endpoint using curl: |
| 143 | + |
| 144 | +```bash |
| 145 | +curl http://localhost:9090/metrics |
| 146 | +``` |
| 147 | + |
| 148 | +Or run the provided test script: |
| 149 | + |
| 150 | +```bash |
| 151 | +./test-metrics.sh |
| 152 | +``` |
| 153 | + |
| 154 | +### Integration Tests |
| 155 | + |
| 156 | +Integration tests for metrics are located in `test/integration/cases/metrics.test.ts`. |
| 157 | + |
| 158 | +Run them with: |
| 159 | + |
| 160 | +```bash |
| 161 | +npm run test:integration |
| 162 | +``` |
| 163 | + |
| 164 | +## Implementation Details |
| 165 | + |
| 166 | +The metrics implementation uses the `prom-client` library and consists of: |
| 167 | + |
| 168 | +1. **Metrics Module** (`src/metrics/index.ts`): |
| 169 | + - Initializes a Prometheus registry |
| 170 | + - Configures default Node.js metrics collection |
| 171 | + - Defines custom HTTP metrics (duration histogram and request counter) |
| 172 | + - Registers GraphQL and MongoDB metrics |
| 173 | + - Provides middleware for tracking HTTP requests |
| 174 | + - Creates a separate Express app for serving metrics |
| 175 | + |
| 176 | +2. **GraphQL Metrics** (`src/metrics/graphql.ts`): |
| 177 | + - Implements Apollo Server plugin for tracking GraphQL operations |
| 178 | + - Tracks operation duration, errors, and resolver execution time |
| 179 | + - Automatically captures operation name, type, and field information |
| 180 | + |
| 181 | +3. **MongoDB Metrics** (`src/metrics/mongodb.ts`): |
| 182 | + - Implements MongoDB command monitoring |
| 183 | + - Tracks command duration and errors |
| 184 | + - Uses MongoDB's command monitoring events |
| 185 | + - Extracts collection families from dynamic collection names to reduce cardinality |
| 186 | + |
| 187 | +4. **Integration** (`src/index.ts`, `src/mongo.ts`): |
| 188 | + - Adds GraphQL metrics plugin to Apollo Server |
| 189 | + - Adds metrics middleware to the main Express app |
| 190 | + - Enables MongoDB command monitoring on database clients |
| 191 | + - Starts metrics server on a separate port |
| 192 | + - Keeps metrics server isolated from main API traffic |
| 193 | + |
| 194 | +## Prometheus Configuration |
| 195 | + |
| 196 | +To scrape these metrics with Prometheus, add the following to your `prometheus.yml`: |
| 197 | + |
| 198 | +```yaml |
| 199 | +scrape_configs: |
| 200 | + - job_name: 'hawk-api' |
| 201 | + static_configs: |
| 202 | + - targets: ['localhost:9090'] |
| 203 | +``` |
| 204 | +
|
| 205 | +Adjust the target host and port according to your deployment. |
0 commit comments