This RFC is to introduce a new metrics source to consume metrics from PostgreSQL database servers. The high level plan is to implement one source that collects metrics from PostgreSQL server instances.
Background reading on PostgreSQL monitoring:
This RFC will cover:
- A new source for PostgreSQL server metrics.
This RFC will not cover:
- Other databases.
Users want to collect, transform, and forward metrics to better observe how their PostgreSQL databases are performing.
Build a single source called postgresql_metrics (name to be confirmed) to collect PostgreSQL metrics. We support all non-EOL'ed PostgreSQL versions.
The recommended implementation is to use the Rust PostgreSQL client to connect the target database server by address specified in configuration.
The source would then run the following queries:
SELECT * FROM pg_stat_databaseSELECT * FROM pg_stat_database_conflictsSELECT * FROM pg_stat_bgwriter
And return these metrics by parsing the query results and converting them into metrics using the database name and column names.
pg_up-> Used as an uptime metric with 0 for successful collection and 1 for failed collection (gauge)pg_stat_database_blk_read_time_seconds_totaltagged with db, host, server, user (counter)pg_stat_database_blk_write_time_seconds_totaltagged with db, host, server, user (counter)pg_stat_database_blks_hit_totaltagged with db, host, server, user (counter)pg_stat_database_blks_read_totaltagged with db, host, server, user (counter)pg_stat_database_stats_resettagged with host, server, user (gauge)pg_stat_bgwriter_buffers_alloc_totaltagged with host, server (counter)pg_stat_bgwriter_buffers_backend_totaltagged with host, server (counter)pg_stat_bgwriter_buffers_backend_fsync_totaltagged with host, server (counter)pg_stat_bgwriter_buffers_checkpoint_totaltagged with host, server (counter)pg_stat_bgwriter_buffers_clean_totaltagged with host, server (counter)pg_stat_bgwriter_checkpoint_sync_time_seconds_totaltagged with host, server (counter)pg_stat_bgwriter_checkpoint_write_time_seconds_totaltagged with host, server (counter)pg_stat_bgwriter_checkpoints_req_totaltagged with host, server (counter)pg_stat_bgwriter_checkpoints_timed_totaltagged with host, server (counter)pg_stat_bgwriter_maxwritten_clean_totaltagged with host, server (counter)pg_stat_bgwriter_stats_resethost, server (gauge)pg_stat_database_conflicts_totaltagged with db, host, server, user (counter)pg_stat_database_datid(database ID) tagged with db, host, server, user (gauge)pg_stat_database_deadlocks_totaltagged with db, host, server, user (counter)pg_stat_database_numbackends_totaltagged with db, host, server, user (gauge)pg_stat_database_temp_bytes_totaltagged with db, host, server, user (counter)pg_stat_database_temp_files_totaltagged with db, host, server, user (counter)pg_stat_database_tup_deleted_totaltagged with db, host, server, user (counter)pg_stat_database_tup_fetched_totaltagged with db, host, server, user (counter)pg_stat_database_tup_inserted_totaltagged with db, host, server, user (counter)pg_stat_database_tup_returned_totaltagged with db, host, server, user (counter)pg_stat_database_tup_updated_totaltagged with db, host, server, user (counter)pg_stat_database_xact_commit_totaltagged with db, host, server, user (counter)pg_stat_database_xact_rollback_totaltagged with db, host, server, user (counter)pg_database_conflicts_totaltagged by db name and server (counter)pg_database_conflicts_confl_bufferpin_totaltagged by db name and server (counter)pg_database_conflicts_confl_deadlock_totaltagged by db name and server (counter)pg_database_conflicts_confl_lock_total taggedby db name and server (counter)pg_database_conflicts_confl_snapshot_totaltagged by db name and server (counter)pg_database_conflicts_confl_tablespace_totaltagged by db name and server (counter)
Naming of metrics is determined via:
table_name _ column_name + Prometheus endings (_total for counters, etc)
For example:
pg_database_conflicts_confl_tablespace_total
Here pg_database_conflicts is the name of the table, confl_tablespace is the column name, and _total is suffixed because counters end in total in the Prometheus naming convention.
This is in line with the Prometheus naming convention for their exporter.
All metrics will also be tagged with the endpoint (stripped of username/password).
The following additional source configuration will be added:
[sources.my_source_id]
type = "postgresql_metrics" # required
endpoint = "postgres://postgres@localhost" # required - address of the PG server.
included_databases = ["production", "testing"] # optional, list of databases to query. Defaults to all if not specified.
excluded_databases = [ "development" ] # optional, excludes specific databases. If a DB is excluded explicitly but included in `included_databases` then it is excluded.
scrape_interval_secs = 15 # optional, default, seconds
namespace = "postgresql" # optional, default is "postgresql", namespace to attach to metrics.We will also expose the HTTP SSL settings and support ssl in the endpoint URL.
- We'd also add a guide for doing this without root permissions.
PostgreSQL is a commonly adopted modern database. Users frequently want to monitor its state and performance.
Additionally, as part of Vector's vision to be the "one tool" for ingesting and shipping observability data, it makes sense to add as many sources as possible to reduce the likelihood that a user will not be able to ingest metrics from their tools.
- https://github.com/wrouesnel/postgres_exporter/
- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/postgresql
- https://collectd.org/wiki/index.php/Plugin:PostgreSQL
- https://collectd.org/documentation/manpages/collectd.conf.5.shtml#plugin_postgresql
- Additional maintenance and integration testing burden of a new source
We could not add the source directly to Vector and instead instruct users to run Telegraf's postgresql input or Prometheus' postgresql_exporter and point Vector at the resulting data. This would leverage the already supported inputs from those projects.
I decided against this as it would be in contrast with one of the listed principles of Vector:
One Tool. All Data. - One simple tool gets your logs, metrics, and traces (coming soon) from A to B.
On the same page, it is mentioned that Vector should be a replacement for Telegraf.
You SHOULD use Vector to replace Logstash, Fluent*, Telegraf, Beats, or similar tools.
If users are already running Telegraf or PostgreSQL Exporter though, they could opt for this path.
- Grab pg_settings - should look at this during implementation.
Incremental steps that execute this change. Generally this is in the form of:
- Submit a PR with the initial source implementation
- Extend source to collect additional database metrics:
- Replication
- Locks
- pg_stat_user_tables