Skip to content

Commit 1eff069

Browse files
authored
Merge pull request #798 from constructive-io/feat/observability
feat(graphql-server): add local-only observability tooling
2 parents edb69fc + 2d40307 commit 1eff069

24 files changed

Lines changed: 1860 additions & 98 deletions

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,5 @@ postgres/pgsql-test/output/
1111
*.tsbuildinfo
1212
.env
1313
.env.local
14+
graphql/server/logs/
15+
graphql/server/*.heapsnapshot

graphql/server/README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,28 @@ Runs an Express server that wires CORS, uploads, domain parsing, auth, and PostG
6666
- File uploads via `graphql-upload`
6767
- GraphiQL and health check endpoints
6868
- Schema cache flush via `/flush` or database notifications
69+
- Opt-in observability for memory, DB activity, and Graphile build debugging
70+
71+
## Observability
72+
73+
`@constructive-io/graphql-server` includes an opt-in observability mode for local debugging.
74+
75+
- Master switch: `GRAPHQL_OBSERVABILITY_ENABLED=true`
76+
- Debug routes: `GET /debug/memory`, `GET /debug/db`
77+
- Background sampler: periodic NDJSON snapshots under `graphql/server/logs/`
78+
- CLI helpers:
79+
- `pnpm debug:memory:analyze`
80+
- `pnpm debug:heap:capture`
81+
82+
Observability only activates when all of the following are true:
83+
84+
- `GRAPHQL_OBSERVABILITY_ENABLED=true`
85+
- `NODE_ENV=development`
86+
- the server is bound to a loopback host such as `localhost`, `127.0.0.1`, or `::1`
87+
88+
When those conditions are not met, the debug routes are not mounted and the sampler does not start. This keeps the default runtime surface minimal and prevents the observability layer from being exposed remotely.
89+
90+
For the operational workflow, sampler output, and heap snapshot usage, see [docs/memory-debugging.md](./docs/memory-debugging.md).
6991

7092
## Routes
7193

@@ -74,6 +96,8 @@ Runs an Express server that wires CORS, uploads, domain parsing, auth, and PostG
7496
- `GET /graphql` / `POST /graphql` -> GraphQL endpoint
7597
- `POST /graphql` (multipart) -> file uploads
7698
- `POST /flush` -> clears cached Graphile schema for the current API
99+
- `GET /debug/memory` -> memory/process/Graphile debug snapshot when observability is enabled
100+
- `GET /debug/db` -> PostgreSQL activity/locks/pool debug snapshot when observability is enabled
77101

78102
## Meta API routing
79103

@@ -113,6 +137,10 @@ Configuration is merged from defaults, config files, and env vars via `@construc
113137
| `API_ANON_ROLE` | Anonymous role name | `administrator` |
114138
| `API_ROLE_NAME` | Authenticated role name | `administrator` |
115139
| `API_DEFAULT_DATABASE_ID` | Default database ID | `hard-coded` |
140+
| `GRAPHQL_OBSERVABILITY_ENABLED` | Master switch for debug routes and sampler | `false` |
141+
| `GRAPHQL_DEBUG_SAMPLER_ENABLED` | Enables periodic NDJSON sampling when observability is on | `true` |
142+
| `GRAPHQL_DEBUG_SAMPLER_INTERVAL_MS` | Sampler interval in milliseconds | `10000` |
143+
| `GRAPHQL_DEBUG_SAMPLER_DIR` | Override output directory for sampler logs | `graphql/server/logs` |
116144

117145
## Testing
118146

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Observability and Memory Debugging Runbook
2+
3+
Use this runbook when reproducing Graphile/server memory growth locally or in staging, or when you need a quick operational view of Graphile build churn and PostgreSQL activity.
4+
5+
## What observability enables
6+
7+
Set `GRAPHQL_OBSERVABILITY_ENABLED=true` to turn on the graphql/server observability surfaces.
8+
9+
When enabled:
10+
11+
- `GET /debug/memory` returns a process snapshot with Node/V8 memory, Graphile cache state, in-flight builds, and Graphile build timings
12+
- `GET /debug/db` returns PostgreSQL pool usage, active queries, blocked sessions, lock summary, and selected database stats
13+
- the debug sampler writes periodic NDJSON snapshots for offline analysis
14+
15+
Observability only activates when all of the following are true:
16+
17+
- `GRAPHQL_OBSERVABILITY_ENABLED=true`
18+
- `NODE_ENV=development`
19+
- the GraphQL server is bound to a loopback host such as `localhost`, `127.0.0.1`, or `::1`
20+
21+
When those conditions are not met:
22+
23+
- `/debug/memory` returns `404`
24+
- `/debug/db` returns `404`
25+
- the sampler does not start
26+
- no debug log directory is created
27+
28+
This mode is intended for local debugging only. It is not a live operational surface.
29+
30+
## Start the server with profiling enabled
31+
32+
```bash
33+
cd graphql/server
34+
NODE_ENV=development \
35+
GRAPHQL_OBSERVABILITY_ENABLED=true \
36+
NODE_OPTIONS="--heapsnapshot-signal=SIGUSR2 --expose-gc" \
37+
GRAPHQL_DEBUG_SAMPLER_ENABLED=true \
38+
GRAPHQL_DEBUG_SAMPLER_INTERVAL_MS=10000 \
39+
pnpm dev
40+
```
41+
42+
`GRAPHQL_OBSERVABILITY_ENABLED` is the master switch. When it is not `true`, the server does not mount `/debug/memory`, does not mount `/debug/db`, and does not start the sampler.
43+
`NODE_ENV` must also be `development`, and the server host must be loopback-only.
44+
45+
Optional knobs:
46+
47+
- `GRAPHQL_DEBUG_SAMPLER_ENABLED=false` disables the sampler while leaving the debug routes available
48+
- `GRAPHQL_DEBUG_SAMPLER_INTERVAL_MS=<ms>` changes the sampling interval
49+
- `GRAPHQL_DEBUG_SAMPLER_DIR=/abs/path` writes the sampler output somewhere other than `graphql/server/logs`
50+
51+
The debug sampler writes one run directory per server process under `graphql/server/logs/` by default.
52+
53+
Expected files per sampler session:
54+
55+
- `debug-memory.ndjson`
56+
- `debug-db.ndjson`
57+
- `debug-sampler-errors.ndjson`
58+
59+
## Debug routes
60+
61+
Use these routes for live inspection while a server is running:
62+
63+
- `GET /debug/memory`
64+
- process memory usage
65+
- V8 heap statistics and heap spaces
66+
- Graphile cache state
67+
- in-flight handler creation count
68+
- Graphile build timing aggregates
69+
- PostGIS codec telemetry
70+
- `GET /debug/db`
71+
- PG pool totals, idle count, and waiters
72+
- active queries and blocked sessions
73+
- lock summary
74+
- selected `pg_stat_database` counters
75+
- `pg_notification_queue_usage()`
76+
77+
## Analyze the latest sampler run
78+
79+
```bash
80+
cd graphql/server
81+
pnpm debug:memory:analyze
82+
```
83+
84+
To analyze a specific run directory:
85+
86+
```bash
87+
cd graphql/server
88+
pnpm debug:memory:analyze -- --dir ./logs/run-2026-03-09T12-00-00-000Z-pid12345
89+
```
90+
91+
## Capture a heap snapshot
92+
93+
```bash
94+
cd graphql/server
95+
pnpm debug:heap:capture -- --pid <server-pid>
96+
```
97+
98+
If your server writes snapshots somewhere else, pass `--dir`.
99+
100+
## Tooling reference
101+
102+
- `pnpm debug:memory:analyze`
103+
- reads the latest sampler directory by default
104+
- summarizes heap/RSS range, Graphile build stats, DB waiters, and blocked sessions
105+
- `pnpm debug:heap:capture -- --pid <server-pid>`
106+
- sends `SIGUSR2` to the server process
107+
- requires `NODE_OPTIONS="--heapsnapshot-signal=SIGUSR2 --expose-gc"`
108+
- prints the created `.heapsnapshot` path
109+
110+
## Recommended incident workflow
111+
112+
1. Start the server with `GRAPHQL_OBSERVABILITY_ENABLED=true` and `NODE_ENV=development` on a loopback host.
113+
2. Reproduce the issue.
114+
3. Inspect `/debug/memory` and `/debug/db` live if you need immediate feedback.
115+
4. Run `pnpm debug:memory:analyze` against the generated logs.
116+
5. If retained heap is still unclear, capture one or more heap snapshots.
117+
6. Disable observability again when you are done.
118+
119+
## How to read the snapshots
120+
121+
Focus on a few high-signal sections first.
122+
123+
- Memory and V8
124+
- `heapUsedBytes`, `rssBytes`, and the V8 heap space breakdown tell you whether pressure is in old space, new space, or large object space
125+
- Graphile cache and builds
126+
- `graphileCache` shows how many cached handlers are live
127+
- `graphileBuilds` shows how often handlers are being rebuilt and how expensive the builds are
128+
- PostgreSQL activity
129+
- `pool.waitingCount`, `blockedActivity`, and `lockSummary` are the fastest indicators of DB contention
130+
- `activeActivity` highlights long-running queries and transaction age
131+
## What to watch for
132+
133+
- `heapUsedMb.max` and `rssMb.max` relative to baseline
134+
- last 6 sampler samples still trending upward in heap or RSS after load stops
135+
- repeated Graphile builds with high `averageBuildMs` or `maxBuildMs`
136+
- blocked DB sessions or `pool.waitingCount > 0`
137+
- active queries with long `xact_age` or `query_age`
138+
139+
## Current acceptance bar
140+
141+
- no blocked DB sessions
142+
- no PG pool waiters
143+
- last 6 idle samples do not trend upward by more than 5% for heap or RSS
144+
145+
## Operational note
146+
147+
The observability routes and sampler are designed for engineering use on a local machine. Keep them disabled by default and do not treat them as a staging or production feature.

graphql/server/package.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@
2626
"build:dev": "makage build --dev",
2727
"dev": "ts-node src/run.ts",
2828
"dev:watch": "nodemon --watch src --ext ts --exec ts-node src/run.ts",
29+
"debug:memory:analyze": "node scripts/analyze-debug-logs.mjs",
30+
"debug:heap:capture": "node scripts/capture-heap-snapshot.mjs",
2931
"lint": "eslint . --fix",
3032
"test": "jest --passWithNoTests",
3133
"test:watch": "jest --watch",

0 commit comments

Comments
 (0)