We already have some good documentation on setting up server timings: https://github.com/epicweb-dev/epic-stack/blob/main/docs/server-timing.md But this only can give me information about requests that I make manually using the devtools. For production applications, we need some way to monitor if everything is working smoothly for end users, and identify potential bottlenecks.
Some info I think is essential to have:
- HTTP response time across all routes (mean, P90, P99)
- Overall HTTP status codes across all routes
- HTTP response time per route (mean, P90, P99)
- HTTP status code per route
It would also be useful to have:
- Show most frequent SQL queries
- Show SQL queries with longest execution time
- Ability to define custom metrics (e.g. recording the "server timing" data mentioned previously somewhere that can be turned into a dashboard)
Fly.io provides a Promethius instance + Grafana dashboard (in preview at the moment):
https://fly.io/docs/reference/metrics/#managed-grafana-preview
This includes some basic information that covers average HTTP response times + status codes, however it doesn't yet break it down by route (which would be useful to work out what's causing issues). There is a way to expose data to Prometheus via an endpoint, so perhaps that's how extra metrics could be added?
It would be great to add some documentation on 1) Grafana and 2) How we could add some metrics not included by default (namely HTTP respone time/status by route, frequency/duration of SQL queries and custom metrics).
We already have some good documentation on setting up server timings: https://github.com/epicweb-dev/epic-stack/blob/main/docs/server-timing.md But this only can give me information about requests that I make manually using the devtools. For production applications, we need some way to monitor if everything is working smoothly for end users, and identify potential bottlenecks.
Some info I think is essential to have:
It would also be useful to have:
Fly.io provides a Promethius instance + Grafana dashboard (in preview at the moment):
https://fly.io/docs/reference/metrics/#managed-grafana-preview
This includes some basic information that covers average HTTP response times + status codes, however it doesn't yet break it down by route (which would be useful to work out what's causing issues). There is a way to expose data to Prometheus via an endpoint, so perhaps that's how extra metrics could be added?
It would be great to add some documentation on 1) Grafana and 2) How we could add some metrics not included by default (namely HTTP respone time/status by route, frequency/duration of SQL queries and custom metrics).