|
| 1 | +# Prometheus Metrics From Queries |
| 2 | + |
| 3 | +## The Problem |
| 4 | + |
| 5 | +Users want to be able to monitor their Materialize workloads and data products. |
| 6 | +Setting up external tools to convert SQL queries into prometheus metrics is labor intensive, error prone, and often buggy. |
| 7 | + |
| 8 | +## Success Criteria |
| 9 | + |
| 10 | +- Users can define SQL queries that get turned into prometheus metrics. |
| 11 | +- Users can group these metrics into HTTP endpoints, so they may have separate scrape configs (for different auth requirements and/or scrape frequency). |
| 12 | + |
| 13 | +## Out of Scope |
| 14 | + |
| 15 | +- Generic HTTP endpoint creation for formats other than Prometheus. |
| 16 | + |
| 17 | + While the proposed solution could easily be extended for other API types, that is not required for this to work for prometheus. |
| 18 | + |
| 19 | +- Removal of the Materialize Cloud promsql exporter. |
| 20 | + |
| 21 | + The promsql exporter relies on internal tables which cannot have views made from them, which would complicate this proposal. |
| 22 | + We should just move all those queries into normal metrics endpoints instead. That way, customers can also get access to these metrics. |
| 23 | + |
| 24 | + There are two open tickets related to this: |
| 25 | + - https://github.com/MaterializeInc/database-issues/issues/10028 |
| 26 | + - https://github.com/MaterializeInc/database-issues/issues/10030 |
| 27 | + |
| 28 | +## Solution Proposal |
| 29 | + |
| 30 | +Allow users to create HTTP endpoints in SQL with custom prometheus metrics. |
| 31 | + |
| 32 | +```sql |
| 33 | +CREATE API mydatabase.myschema.myprometheus FORMAT PROMETHEUS IN CLUSTER "mycluster"; |
| 34 | +``` |
| 35 | +This will create an HTTP endpoint at `/api/metrics/custom/mydatabase/myschema/myprometheus` on all HTTP listeners with the `endpoint_api` enabled in the listeners configmap. |
| 36 | + |
| 37 | +This new api object would be added to a system table `mz_apis` for later reference: |
| 38 | +``` |
| 39 | +id TEXT, |
| 40 | +oid OID, |
| 41 | +schema_id TEXT, |
| 42 | +name TEXT, |
| 43 | +cluster_id TEXT, |
| 44 | +owner_id TEXT, |
| 45 | +privileges mz_aclitem[] |
| 46 | +``` |
| 47 | + |
| 48 | +The `cluster_id` references the cluster used to peek the metric source relations (corresponds to `mz_clusters.id`). |
| 49 | + |
| 50 | +Users can then add metrics to that endpoint using SQL commands: |
| 51 | +```sql |
| 52 | +CREATE METRIC <name> |
| 53 | +IN API <api> |
| 54 | +AS (TYPE <prometheus_type>, |
| 55 | + HELP <help_text>, |
| 56 | + SERIES FROM <reference_to_view>, |
| 57 | + VALUE COLUMN <name_of_value_column>); |
| 58 | +``` |
| 59 | + |
| 60 | +This will add a new metric object to a system table `mz_metrics` for later reference: |
| 61 | +``` |
| 62 | +id TEXT, |
| 63 | +oid OID, |
| 64 | +schema_id TEXT, |
| 65 | +name TEXT, |
| 66 | +api_id TEXT, |
| 67 | +type TEXT, |
| 68 | +help TEXT, |
| 69 | +series_from TEXT, |
| 70 | +value_column TEXT, |
| 71 | +owner_id TEXT |
| 72 | +``` |
| 73 | + |
| 74 | +The `name`, `type`, and `help` fields describe the prometheus metric itself. `api_id` references the `mz_apis` entry the metric is attached to. |
| 75 | + |
| 76 | +The `series_from` field is the ID of the relation containing the metric data (corresponds to `mz_catalog.mz_relations.id`). The `value_column` is the name of a column in that relation which contains the value of the metric. All other columns in the relation will be used as labels. |
| 77 | + |
| 78 | +An example metric view: |
| 79 | +```sql |
| 80 | + CREATE VIEW converted_leads |
| 81 | +AS |
| 82 | + (SELECT Count(*), |
| 83 | + converted |
| 84 | + FROM (SELECT id, |
| 85 | + CASE |
| 86 | + WHEN converted_at IS NULL THEN 'FALSE' |
| 87 | + ELSE 'TRUE' |
| 88 | + END AS converted |
| 89 | + FROM leads) |
| 90 | + GROUP BY converted); |
| 91 | +``` |
| 92 | + |
| 93 | +This might look like: |
| 94 | +| count | converted | |
| 95 | +|-------|-----------| |
| 96 | +|22|TRUE| |
| 97 | +|67|FALSE| |
| 98 | + |
| 99 | +The user can then add this metric to their registry: |
| 100 | +```sql |
| 101 | +CREATE METRIC leads |
| 102 | +IN API mydatabase.myschema.myprometheus |
| 103 | +AS (TYPE 'gauge', |
| 104 | + HELP 'Count of leads and whether they have been converted', |
| 105 | + SERIES FROM mydatabase.myschema.converted_leads, |
| 106 | + VALUE COLUMN 'count'); |
| 107 | +``` |
| 108 | + |
| 109 | +When querying the HTTP endpoint at `/api/metrics/custom/mydatabase/myschema/myprometheus`, they would then get a response like: |
| 110 | +``` |
| 111 | +# HELP mz_custom_leads Count of leads and whether they have been converted |
| 112 | +# TYPE mz_custom_leads gauge |
| 113 | +mz_custom_leads{converted="TRUE"} 22 |
| 114 | +mz_custom_leads{converted="FALSE"} 67 |
| 115 | +``` |
| 116 | + |
| 117 | +All exposed metric names are prefixed with `mz_custom_` to namespace user-defined metrics and avoid collisions with Materialize's built-in metrics. The prefix is injected at exposition time; the user-supplied metric name (e.g. `leads`) is what appears in `CREATE METRIC` and the `mz_metrics` catalog. |
| 118 | + |
| 119 | +## RBAC |
| 120 | + |
| 121 | +Scrapes do **not** run as the API owner. Each request to `/api/metrics/custom/...` runs as the role that authenticated the HTTP request, exactly as if that role had issued the underlying `SELECT`s itself. On listeners with `authenticator_kind = "None"`, the role is taken from the basic-auth userinfo (the password is not checked); if no username is supplied, the request falls back to the built-in `anonymous_http_user` role. |
| 122 | + |
| 123 | +For a scrape to succeed, the scraping role must hold: |
| 124 | + |
| 125 | +- `USAGE` on the API object, |
| 126 | +- `USAGE` on the API's cluster (the cluster used to peek the metric relations), and |
| 127 | +- `SELECT` on every relation referenced by a metric's `SERIES FROM` (and the `USAGE` on the containing database/schema that `SELECT` already requires). |
| 128 | + |
| 129 | +When a permission is missing, the endpoint fails the whole scrape rather than silently omitting metrics (a partial exposition would otherwise look like a healthy target reporting zero): |
| 130 | + |
| 131 | +- Missing `USAGE` on the API → `404 Not Found`. The API is resolved from the catalog and gated before any query runs; returning `404` rather than `403` means the role cannot distinguish "API exists but you can't see it" from "no such API", matching how Materialize hides objects a role has no access to. |
| 132 | +- Missing `USAGE` on the cluster, or missing `SELECT` on any metric's `SERIES FROM` relation → `403 Forbidden`. Both surface from executing the scrape query as the scraping role (SQLSTATE `42501`, insufficient privilege), so they are not distinguished from each other. |
| 133 | + |
| 134 | +The API-`USAGE` check reuses the same RBAC gating predicate as `CREATE`-time validation (`is_rbac_enforced_for_session`), so scrape-time checks cannot drift from the rules applied when the object was defined, and the cluster/relation checks fall out of running the query itself. This also means the usual escape hatches apply: superusers, system roles, and environments with RBAC disabled bypass these checks. |
| 135 | + |
| 136 | +## Minimal Viable Prototype |
| 137 | + |
| 138 | +- [Hackathon presentation from May 2025](https://docs.google.com/presentation/d/1ek0tOlECHfpoBp_-vtcDWhN4YHpaWBfRuQyENGFbWLw/edit?slide=id.g35c518b4039_14_3503#slide=id.g35c518b4039_14_3503) |
| 139 | +- [Hackathon code from May 2025](https://github.com/MaterializeInc/materialize/compare/main...alex-hunt-materialize:materialize:external_api) |
| 140 | +- [Hackathon brainstorming from May 2025](https://www.notion.so/materialize/Hackathon-Alex-Justin-1f913f48d37b805e88b0e25a8ad1a763) |
| 141 | + |
| 142 | +While not the exact same interface, it captures the idea proposed here. |
| 143 | + |
| 144 | +## Alternatives |
| 145 | + |
| 146 | +External SQL exporters. |
| 147 | + |
| 148 | +We currently use one of our own in Materialize Cloud, which we wrote after we hit numerous problems with third-party ones. We currently still recommend third-party solutions to our customers, which is not ideal. |
| 149 | + |
| 150 | +## Open questions |
| 151 | + |
| 152 | +- Exact syntax and SQL object types. We might want to have dedicated SQL syntax for creating metrics, or have some dedicated reference to the views rather than text fields, for example. |
| 153 | +- Should we require indexed or materialized views? |
0 commit comments