|
| 1 | +# Bench methodology — the definitive reference |
| 2 | + |
| 3 | +Adapted from GitLab issue [postgres-ai/postgresql-consulting/tests-and-benchmarks#77, note 3263767264](https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues/77#note_3263767264). |
| 4 | + |
| 5 | +This document is the single source of truth for how the PostgreSQL-queue bench is structured, what it measures, and every script/config that makes it work. It is written so a reviewer can reproduce the whole thing end-to-end. |
| 6 | + |
| 7 | +Cross-links: |
| 8 | + |
| 9 | +- Upstream fix: [pgque PR #62](https://github.com/NikolayS/pgque/pull/62) · issue [NikolayS/pgque#61](https://github.com/NikolayS/pgque/issues/61) |
| 10 | +- GitLab rounds: R4 final narrative — note 3262968456; R5 root — note 3263000815; R5 full writeup — note 3263438287; pgque rotation PoC under held xmin — note 3263417609. |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## 1. Goals and pathology |
| 15 | + |
| 16 | +We reproduce Brandur Leach's "Postgres queues" death spiral (see [Brandur's 2015 post](https://brandur.org/postgres-queues)): a DELETE-based work-queue becomes unvacuumably bloated the moment an unrelated backend holds `xmin` in the past (long transactions, logical-replication slots, stuck standby feedback). Dead tuples accumulate; bitmap / index scans must traverse them; latency explodes; throughput collapses even after the holder releases. |
| 17 | + |
| 18 | +Seven systems, each on its own VM: |
| 19 | + |
| 20 | +| System | Version | Pattern | |
| 21 | +|---|---|---| |
| 22 | +| **pgque** | patched via PR #62 | batch ticker + rotating event tables (TRUNCATE) | |
| 23 | +| pgq | v3.5.1 PL-only | same model as pgque; upstream baseline | |
| 24 | +| pgmq | v1.11.0 | single queue table, VISIBILITY + DELETE | |
| 25 | +| pgmq-partitioned | v1.11.0 + pg_partman | range-partitioned queue table | |
| 26 | +| river | v0.34 | SQL-level SKIP LOCKED + DELETE (consumer emulated) | |
| 27 | +| que | v2.4 | SKIP LOCKED + DELETE (consumer emulated) | |
| 28 | +| pg-boss | v12.15 | SKIP LOCKED + DELETE on partitioned `pgboss.job` | |
| 29 | + |
| 30 | +The Go/Ruby/Node workers are installed end-to-end so the schema is authentic, but the actual consumer load is driven via pgbench running each system's *SQL claim pattern*. This isolates the DB-side behaviour from runtime/GC artefacts. |
| 31 | + |
| 32 | +Workload shape (all runs): |
| 33 | + |
| 34 | +- Producer: 1 client, pgbench `-R 1000` rate-cap (may move to `-R 5000` in R7/R8) |
| 35 | +- Consumer: 1 client for pgque/pgq, 4 clients for everything else |
| 36 | +- Three 30-minute phases back-to-back = **1.5 h per run**: |
| 37 | + 1. **Clean baseline** — no held xmin |
| 38 | + 2. **Held xmin** — `idle_in_tx.py` holds `REPEATABLE READ` open |
| 39 | + 3. **Clean recovery** — holder killed, observe regrowth / catchup |
| 40 | + |
| 41 | +--- |
| 42 | + |
| 43 | +## 2. Infrastructure |
| 44 | + |
| 45 | +- AWS us-east-2, **i4i.2xlarge** (8 vCPU, 64 GB, NVMe instance store) |
| 46 | +- Spot where available; on-demand only as last resort (see Section 10) |
| 47 | +- Ubuntu 24.04, **PG18 from PGDG**, `pg_cron`, `pg_stat_statements`, pg_ash, pgfr |
| 48 | +- Data dir moved to NVMe: `/mnt/pgdata/postgresql/18/main` symlink (see [runners/fix_nvme_mount.sh](runners/fix_nvme_mount.sh) and [OPS_GOTCHAS.md §1](OPS_GOTCHAS.md)) |
| 49 | +- One VM per system so any tuning / runtime / GC behaviour is contained |
| 50 | +- SSH key: `<your-ssh-key>` (us-east-2) |
| 51 | + |
| 52 | +Live R7 VMs (redacted): `<pgque-ip>`, `<pgq-ip>`, `<pgmq-ip>`, `<river-ip>`, `<pgboss-ip>`, `<que-ip>`, `<pgmq-partitioned-ip>`. |
| 53 | + |
| 54 | +--- |
| 55 | + |
| 56 | +## 3. Observability stack |
| 57 | + |
| 58 | +Seven parallel streams run during every bench. All CSVs land in `/tmp/bench/` and are rsynced to the local `/tmp/bench_r<N>/<system>/` tree every 30 min. |
| 59 | + |
| 60 | +### (a) bloat_sampler.py — pg_stat_user_tables every 30 s |
| 61 | + |
| 62 | +Per-system filter covers each queue's event/metadata tables (including pg-boss partitions and pgmq-partitioned's partman partitions). Writes `bloat.csv`. |
| 63 | + |
| 64 | +Source: [tooling/bloat_sampler.py](tooling/bloat_sampler.py). |
| 65 | + |
| 66 | +### (b) pg_ash — 1 Hz wait-event sampling |
| 67 | + |
| 68 | +pg_cron runs ash sampling at 1 Hz throughout; at end of bench we export: |
| 69 | + |
| 70 | +```sql |
| 71 | +COPY (SELECT sample_time, database_name, active_backends, wait_event, query_id |
| 72 | + FROM ash.samples(p_interval => '2 hour'::interval, p_limit => 2000000)) |
| 73 | +TO '/tmp/bench/ash.csv' CSV HEADER; |
| 74 | +``` |
| 75 | + |
| 76 | +### (c) pgfr (pg-flight-recorder) — snapshot observability |
| 77 | + |
| 78 | +pgfr writes snapshots on a pg_cron schedule into `pgfr_record.*`. At end of run we export: |
| 79 | + |
| 80 | +```sql |
| 81 | +COPY (SELECT * FROM pgfr_record.snapshots) TO '/tmp/bench/pgfr_snapshots.csv' CSV HEADER; |
| 82 | +COPY (SELECT * FROM pgfr_record.table_snapshots) TO '/tmp/bench/pgfr_table_snapshots.csv' CSV HEADER; |
| 83 | +COPY (SELECT * FROM pgfr_record.statement_snapshots) TO '/tmp/bench/pgfr_statement_snapshots.csv' CSV HEADER; |
| 84 | +``` |
| 85 | + |
| 86 | +Full pgfr_record schema on every VM includes: `snapshots`/`snapshots_v2` (partitioned), `table_snapshots(_v2)`, `statement_snapshots(_v2)`, `index_snapshots(_v2)`, `replication_snapshots(_v2)`, `vacuum_progress_snapshots(_v2)`, `activity_samples(_archive_v2)`, `lock_samples(_archive_v2)`, `config_snapshots`, `db_role_config_snapshots`. |
| 87 | + |
| 88 | +### (d) sys_metrics_sampler.py — CPU / mem / disk every 10 s (R7+) |
| 89 | + |
| 90 | +Reads `/proc/stat`, `/proc/meminfo`, `/proc/diskstats` directly (psutil-optional). NVMe device is `nvme1n1` (the instance store, which is what `/mnt/pgdata` sits on). v2 adds per-device IOPS and latency columns. |
| 91 | + |
| 92 | +Source: [tooling/sys_metrics_sampler.py](tooling/sys_metrics_sampler.py). |
| 93 | + |
| 94 | +### (e) pg_stat_statements_snapshot.py — pgss time-series every 10 s |
| 95 | + |
| 96 | +Polls pgss with a regex filter over our queue-related query shapes, diffs consecutive snapshots downstream. Used as a cross-check for NOTICE-based ev/s (important for pgque/pgq/pgmq where DO-block wrappers hide per-statement rows). |
| 97 | + |
| 98 | +Source: [tooling/pg_stat_statements_snapshot.py](tooling/pg_stat_statements_snapshot.py). |
| 99 | + |
| 100 | +### (f) pgbench `--aggregate-interval=10 --log` |
| 101 | + |
| 102 | +Both producer and consumer run with per-10 s aggregate logs (min/max/sum/sumsq latency). Files: `producer_agg.<pid>` and `consumer_agg.<pid>.<worker>` under `/tmp/bench/`. |
| 103 | + |
| 104 | +### (g) Consumer NOTICE instrumentation (R6+) |
| 105 | + |
| 106 | +Each instrumented `consumer.sql` is wrapped in a DO block that emits exactly one `RAISE NOTICE 'ev ts=<epoch_s> n=<events>'` per call. This gives us an authoritative per-call consumed-events stream that is immune to the pgss DO-wrapper opacity problem. |
| 107 | + |
| 108 | +The seven instrumented consumers are in [consumers/](consumers/). The parser for `consumer.log` NOTICE lines is [tooling/parse_events_consumed.py](tooling/parse_events_consumed.py). |
| 109 | + |
| 110 | +### (h) idle_in_tx.py — the death-spiral inducer |
| 111 | + |
| 112 | +Opens a `REPEATABLE READ` transaction, pins xmin, sleeps forever. Kill with SIGTERM to release the horizon. Source: [tooling/idle_in_tx.py](tooling/idle_in_tx.py). |
| 113 | + |
| 114 | +### (i) pgq_ticker_daemon.py — tight ticker loop for pgq |
| 115 | + |
| 116 | +pgq upstream has no built-in ticker daemon (the C one was always external). pgque runs inline; for a fair comparison we run a tight Python loop on the pgq VM that calls `pgq.ticker()` at 1 Hz and `pgq.maint_operations()` every 5 s. Source: [tooling/pgq_ticker_daemon.py](tooling/pgq_ticker_daemon.py). |
| 117 | + |
| 118 | +--- |
| 119 | + |
| 120 | +## 4. Bench runner |
| 121 | + |
| 122 | +Orchestrates the phase scheduler. Forks producer, consumer (both pgbench), bloat_sampler, sys_metrics_sampler, pgss snapshotter; at t=1800 s opens `idle_in_tx`; runs `VACUUM VERBOSE` at phase boundaries; at t=3600 s kills the holder; another `VACUUM VERBOSE`; at t=5400 s harvests ash/pgfr/pgss. |
| 123 | + |
| 124 | +Source: [runners/run_r7.sh](runners/run_r7.sh). |
| 125 | + |
| 126 | +--- |
| 127 | + |
| 128 | +## 5. Clean-slate reset |
| 129 | + |
| 130 | +Before every run we kill stragglers, DROP the system's schema/extensions, unschedule lingering `pg_cron` jobs, re-run `install.sh`, and reset all stats. The companion `full_reset.sql` calls `pg_stat_statements_reset()`, `pg_stat_reset()`, `pg_stat_reset_shared()` for bgwriter/checkpointer/wal/io, `TRUNCATE ash.sample`, and `DELETE FROM cron.job_run_details`. |
| 131 | + |
| 132 | +Source: [runners/clean_reinstall.sh](runners/clean_reinstall.sh). See also [OPS_GOTCHAS.md §4, §5, §7](OPS_GOTCHAS.md) for adjacent-schema pitfalls. |
| 133 | + |
| 134 | +--- |
| 135 | + |
| 136 | +## 6. Per-system install scripts |
| 137 | + |
| 138 | +Each VM has its own `/tmp/install.sh` (mirrored into [install/](install/)). Pattern differs by system: |
| 139 | + |
| 140 | +- **pgque** — `git clone` the PR #62 branch, `make USE_PGXS=1 install`, run the SQL installer, `SELECT pgque.create_queue('bench_queue')`, schedule `pgque.ticker()` via `pg_cron` (also called inline in `consumer.sql`). Driven from AMI user-data; see [install/README.md](install/README.md) and [install/bootstrap.sh](install/bootstrap.sh). |
| 141 | +- **pgq** — PGDG `postgresql-18-pgq3` or `git clone --branch v3.5.1`, then `CREATE EXTENSION pgq` and immediately apply `switch_plonly.sql` to replace the C `insert_event_raw` with PL/pgSQL (pg_proc lang check `= 'plpgsql'` as a gate). See [install/install_pgq.sh](install/install_pgq.sh). |
| 142 | +- **pgmq** — PGDG `postgresql-18-pgmq` + `CREATE EXTENSION pgmq` + `pgmq.create('bench_queue')`. See [install/install_pgmq.sh](install/install_pgmq.sh). |
| 143 | +- **pgmq-partitioned** — pgmq + `pg_partman` + `pgmq.create_partitioned('bench_queue', 'id', '10000')`. Reuses [install/install_pgmq.sh](install/install_pgmq.sh) plus [install/pgmq-partitioned_setup_5min.sql](install/pgmq-partitioned_setup_5min.sql) for the partman schedule. |
| 144 | +- **river** — `go install github.com/riverqueue/river/cmd/river@v0.34.x` and `river migrate-up` (sets schema), then the consumer SQL emulates the SKIP-LOCKED claim pattern. See [install/install_river.sh](install/install_river.sh). |
| 145 | +- **que** — Ruby + `gem install que -v 2.4.x`, `bundle exec que:install` creates `que_jobs`; consumer SQL emulates the Ruby Que SELECT/DELETE. Driven from AMI user-data — see [install/README.md](install/README.md) and [OPS_GOTCHAS.md §7](OPS_GOTCHAS.md). |
| 146 | +- **pg-boss** — local `npm install pg-boss@12.15`, run `new PgBoss(DSN).start()` once to migrate schema; consumer SQL emulates the `SKIP LOCKED` claim on `pgboss.job`. See [install/install_pgboss.sh](install/install_pgboss.sh). |
| 147 | + |
| 148 | +--- |
| 149 | + |
| 150 | +## 7. Per-system consumer.sql patterns |
| 151 | + |
| 152 | +All seven instrumented consumers are in [consumers/](consumers/): |
| 153 | + |
| 154 | +- [consumer_pgque.sql](consumers/consumer_pgque.sql) — ticker + `next_batch` + `get_batch_events` + `finish_batch` |
| 155 | +- [consumer_pgq.sql](consumers/consumer_pgq.sql) — identical, schema swap pgque → pgq |
| 156 | +- [consumer_pgmq.sql](consumers/consumer_pgmq.sql) — `pgmq.read(50)` + `pgmq.delete` |
| 157 | +- [consumer_pgmq-partitioned.sql](consumers/consumer_pgmq-partitioned.sql) — identical SQL to pgmq (schema hides partitioning) |
| 158 | +- [consumer_river.sql](consumers/consumer_river.sql) — SKIP LOCKED + DELETE on `river_job` |
| 159 | +- [consumer_que.sql](consumers/consumer_que.sql) — SKIP LOCKED + DELETE on `que_jobs` |
| 160 | +- [consumer_pgboss.sql](consumers/consumer_pgboss.sql) — SKIP LOCKED + DELETE on `pgboss.job` |
| 161 | + |
| 162 | +All producers are in [producers/](producers/). |
| 163 | + |
| 164 | +--- |
| 165 | + |
| 166 | +## 8. Analysis and chart generation |
| 167 | + |
| 168 | +- [charts/r5_analyze.py](charts/r5_analyze.py) — R5 full 2-panel chart (dead tuples + consumer latency, linear scale, no symlog — per [MEMORY rule](../CLAUDE.md) "Never use log/symlog on charts"). |
| 169 | +- [charts/r6_smoke_chart.py](charts/r6_smoke_chart.py) — R6 smoke Solarized-Dark 2-panel chart (events/s + pgque per-table dead tuples). |
| 170 | +- [gifs/r4_gif_v17_solarized.py](gifs/r4_gif_v17_solarized.py) — R4 dead-tuples animated GIF (7 systems, Solarized-Dark). |
| 171 | +- [gifs/r4_gif_tps_solarized.py](gifs/r4_gif_tps_solarized.py) — R4 TPS/latency animated GIF. |
| 172 | + |
| 173 | +--- |
| 174 | + |
| 175 | +## 9. GitLab posting style |
| 176 | + |
| 177 | +- Threads open under `### header`; replies stay inside the same discussion. |
| 178 | +- Every `<details>` block has **one blank line** between the opening tag and the triple-backtick fence, and **one blank line** between the closing fence and `</details>` — and another blank line after `</details>`. This is what makes them render correctly on GitLab. |
| 179 | +- Every significant result embeds `` — charts are first-class citizens. GIFs too. |
| 180 | +- Post bodies via `curl --data-urlencode "body@/tmp/path"` to the discussions endpoint — new thread, not a reply under an existing one. |
| 181 | + |
| 182 | +--- |
| 183 | + |
| 184 | +## 10. Cost discipline |
| 185 | + |
| 186 | +- us-east-2 spot preferred (~$0.22/h); when spot is exhausted we hop to another region before going on-demand. |
| 187 | +- At R7 time 2 VMs are on-demand (pgmq-partitioned and pgboss; pgque+que moved on-demand after R6 spot reclaim). |
| 188 | +- **~$15 per 1.5 h bench round** (7 VMs). |
| 189 | +- Total spend through R7 is ~$100. |
| 190 | +- Rsync every 30 min pulls `/tmp/bench/` to local, so a spot reclaim loses at most a partial phase. |
| 191 | + |
| 192 | +See [OPS_GOTCHAS.md §12](OPS_GOTCHAS.md) for spot-reclaim mitigations. |
| 193 | + |
| 194 | +--- |
| 195 | + |
| 196 | +## Reproducing from scratch |
| 197 | + |
| 198 | +1. Launch 7 × i4i.2xlarge in us-east-2 (spot first) with `bench_userdata.sh` (symlinks `/mnt/pgdata`, installs PGDG PG18, pg_cron, pg_ash, pgfr, pg_stat_statements). See [runners/fix_nvme_mount.sh](runners/fix_nvme_mount.sh) for the NVMe side if the userdata path breaks. |
| 199 | +2. `scp` per-system `install.sh`, `producer.sql`, instrumented `consumer.sql`, plus the samplers in [tooling/](tooling/). |
| 200 | +3. `bash clean_reinstall.sh <sys>` on each VM. |
| 201 | +4. `bash run_r7.sh <sys>` — 1.5 h, writes `/tmp/bench/*.csv` + `*.log`. |
| 202 | +5. Rsync everything back to `/tmp/bench_r7/<sys>/`. |
| 203 | +6. Run [charts/r5_analyze.py](charts/r5_analyze.py) (or R7 successor) for the verdict table + 2-panel PNG. |
| 204 | +7. Post results as a threaded reply with the `<details>`/blank-line style above. |
| 205 | + |
| 206 | +That is the whole rig. |
0 commit comments