|
| 1 | +# Observability and Memory Debugging Runbook |
| 2 | + |
| 3 | +Use this runbook when reproducing Graphile/server memory growth locally or in staging, or when you need a quick operational view of Graphile build churn and PostgreSQL activity. |
| 4 | + |
| 5 | +## What observability enables |
| 6 | + |
| 7 | +Set `GRAPHQL_OBSERVABILITY_ENABLED=true` to turn on the graphql/server observability surfaces. |
| 8 | + |
| 9 | +When enabled: |
| 10 | + |
| 11 | +- `GET /debug/memory` returns a process snapshot with Node/V8 memory, Graphile cache state, in-flight builds, and Graphile build timings |
| 12 | +- `GET /debug/db` returns PostgreSQL pool usage, active queries, blocked sessions, lock summary, and selected database stats |
| 13 | +- the debug sampler writes periodic NDJSON snapshots for offline analysis |
| 14 | + |
| 15 | +Observability only activates when all of the following are true: |
| 16 | + |
| 17 | +- `GRAPHQL_OBSERVABILITY_ENABLED=true` |
| 18 | +- `NODE_ENV=development` |
| 19 | +- the GraphQL server is bound to a loopback host such as `localhost`, `127.0.0.1`, or `::1` |
| 20 | + |
| 21 | +When those conditions are not met: |
| 22 | + |
| 23 | +- `/debug/memory` returns `404` |
| 24 | +- `/debug/db` returns `404` |
| 25 | +- the sampler does not start |
| 26 | +- no debug log directory is created |
| 27 | + |
| 28 | +This mode is intended for local debugging only. It is not a live operational surface. |
| 29 | + |
| 30 | +## Start the server with profiling enabled |
| 31 | + |
| 32 | +```bash |
| 33 | +cd graphql/server |
| 34 | +NODE_ENV=development \ |
| 35 | +GRAPHQL_OBSERVABILITY_ENABLED=true \ |
| 36 | +NODE_OPTIONS="--heapsnapshot-signal=SIGUSR2 --expose-gc" \ |
| 37 | +GRAPHQL_DEBUG_SAMPLER_ENABLED=true \ |
| 38 | +GRAPHQL_DEBUG_SAMPLER_INTERVAL_MS=10000 \ |
| 39 | +pnpm dev |
| 40 | +``` |
| 41 | + |
| 42 | +`GRAPHQL_OBSERVABILITY_ENABLED` is the master switch. When it is not `true`, the server does not mount `/debug/memory`, does not mount `/debug/db`, and does not start the sampler. |
| 43 | +`NODE_ENV` must also be `development`, and the server host must be loopback-only. |
| 44 | + |
| 45 | +Optional knobs: |
| 46 | + |
| 47 | +- `GRAPHQL_DEBUG_SAMPLER_ENABLED=false` disables the sampler while leaving the debug routes available |
| 48 | +- `GRAPHQL_DEBUG_SAMPLER_INTERVAL_MS=<ms>` changes the sampling interval |
| 49 | +- `GRAPHQL_DEBUG_SAMPLER_DIR=/abs/path` writes the sampler output somewhere other than `graphql/server/logs` |
| 50 | + |
| 51 | +The debug sampler writes one run directory per server process under `graphql/server/logs/` by default. |
| 52 | + |
| 53 | +Expected files per sampler session: |
| 54 | + |
| 55 | +- `debug-memory.ndjson` |
| 56 | +- `debug-db.ndjson` |
| 57 | +- `debug-sampler-errors.ndjson` |
| 58 | + |
| 59 | +## Debug routes |
| 60 | + |
| 61 | +Use these routes for live inspection while a server is running: |
| 62 | + |
| 63 | +- `GET /debug/memory` |
| 64 | + - process memory usage |
| 65 | + - V8 heap statistics and heap spaces |
| 66 | + - Graphile cache state |
| 67 | + - in-flight handler creation count |
| 68 | + - Graphile build timing aggregates |
| 69 | + - PostGIS codec telemetry |
| 70 | +- `GET /debug/db` |
| 71 | + - PG pool totals, idle count, and waiters |
| 72 | + - active queries and blocked sessions |
| 73 | + - lock summary |
| 74 | + - selected `pg_stat_database` counters |
| 75 | + - `pg_notification_queue_usage()` |
| 76 | + |
| 77 | +## Analyze the latest sampler run |
| 78 | + |
| 79 | +```bash |
| 80 | +cd graphql/server |
| 81 | +pnpm debug:memory:analyze |
| 82 | +``` |
| 83 | + |
| 84 | +To analyze a specific run directory: |
| 85 | + |
| 86 | +```bash |
| 87 | +cd graphql/server |
| 88 | +pnpm debug:memory:analyze -- --dir ./logs/run-2026-03-09T12-00-00-000Z-pid12345 |
| 89 | +``` |
| 90 | + |
| 91 | +## Capture a heap snapshot |
| 92 | + |
| 93 | +```bash |
| 94 | +cd graphql/server |
| 95 | +pnpm debug:heap:capture -- --pid <server-pid> |
| 96 | +``` |
| 97 | + |
| 98 | +If your server writes snapshots somewhere else, pass `--dir`. |
| 99 | + |
| 100 | +## Tooling reference |
| 101 | + |
| 102 | +- `pnpm debug:memory:analyze` |
| 103 | + - reads the latest sampler directory by default |
| 104 | + - summarizes heap/RSS range, Graphile build stats, DB waiters, and blocked sessions |
| 105 | +- `pnpm debug:heap:capture -- --pid <server-pid>` |
| 106 | + - sends `SIGUSR2` to the server process |
| 107 | + - requires `NODE_OPTIONS="--heapsnapshot-signal=SIGUSR2 --expose-gc"` |
| 108 | + - prints the created `.heapsnapshot` path |
| 109 | + |
| 110 | +## Recommended incident workflow |
| 111 | + |
| 112 | +1. Start the server with `GRAPHQL_OBSERVABILITY_ENABLED=true` and `NODE_ENV=development` on a loopback host. |
| 113 | +2. Reproduce the issue. |
| 114 | +3. Inspect `/debug/memory` and `/debug/db` live if you need immediate feedback. |
| 115 | +4. Run `pnpm debug:memory:analyze` against the generated logs. |
| 116 | +5. If retained heap is still unclear, capture one or more heap snapshots. |
| 117 | +6. Disable observability again when you are done. |
| 118 | + |
| 119 | +## How to read the snapshots |
| 120 | + |
| 121 | +Focus on a few high-signal sections first. |
| 122 | + |
| 123 | +- Memory and V8 |
| 124 | + - `heapUsedBytes`, `rssBytes`, and the V8 heap space breakdown tell you whether pressure is in old space, new space, or large object space |
| 125 | +- Graphile cache and builds |
| 126 | + - `graphileCache` shows how many cached handlers are live |
| 127 | + - `graphileBuilds` shows how often handlers are being rebuilt and how expensive the builds are |
| 128 | +- PostgreSQL activity |
| 129 | + - `pool.waitingCount`, `blockedActivity`, and `lockSummary` are the fastest indicators of DB contention |
| 130 | + - `activeActivity` highlights long-running queries and transaction age |
| 131 | +## What to watch for |
| 132 | + |
| 133 | +- `heapUsedMb.max` and `rssMb.max` relative to baseline |
| 134 | +- last 6 sampler samples still trending upward in heap or RSS after load stops |
| 135 | +- repeated Graphile builds with high `averageBuildMs` or `maxBuildMs` |
| 136 | +- blocked DB sessions or `pool.waitingCount > 0` |
| 137 | +- active queries with long `xact_age` or `query_age` |
| 138 | + |
| 139 | +## Current acceptance bar |
| 140 | + |
| 141 | +- no blocked DB sessions |
| 142 | +- no PG pool waiters |
| 143 | +- last 6 idle samples do not trend upward by more than 5% for heap or RSS |
| 144 | + |
| 145 | +## Operational note |
| 146 | + |
| 147 | +The observability routes and sampler are designed for engineering use on a local machine. Keep them disabled by default and do not treat them as a staging or production feature. |
0 commit comments