Skip to content

Commit ee105e8

Browse files
Nik Samokhvalovclaude
authored andcommitted
docs: complete bench methodology + tooling + ops gotchas under benchmark/
Adds a strictly-additive benchmark/ directory documenting the methodology, tooling, and operational lessons from the pgque-vs-pgq-vs-pgmq-vs-river-vs-que-vs-pgboss-vs-pgmq-partitioned bench that backs #61 and PR #62. - README.md: entry point + quick-start - METHODOLOGY.md: methodology fix per review feedback - OPS_GOTCHAS.md: 15 operational lessons (NEW — NVMe mount, partman stale rows, que func leftovers, pgboss covering index, pgq ticker, pgque xid8 bug, spot reclaim, ASH prereqs, NOTICE instrumentation, etc.) - HARDWARE.md: i4i.2xlarge specs, PG tuning, microbench baselines - tooling/, runners/, consumers/, producers/, install/, charts/, gifs/ No pgque production SQL is touched. Refs: #61, #62. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d96817b commit ee105e8

39 files changed

Lines changed: 3366 additions & 0 deletions

benchmark/HARDWARE.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Hardware — VM sizing and microbench baselines
2+
3+
## VM: AWS i4i.2xlarge (us-east-2)
4+
5+
| Spec | Value |
6+
|---|---|
7+
| vCPU | 8 (Intel Ice Lake Xeon 8375C, 2.9 GHz base / 3.5 GHz turbo) |
8+
| RAM | 64 GB |
9+
| NVMe instance store | 1 × 1.75 TB (physical attach, NVMe) |
10+
| Network | Up to 12 Gbps |
11+
| EBS (root) | 8 GB gp3 (Ubuntu 24.04 AMI default) |
12+
| Spot price (us-east-2, 2026-04) | ~$0.20–0.30 / hour |
13+
| On-demand price | $0.686 / hour |
14+
15+
See [OPS_GOTCHAS.md §1](OPS_GOTCHAS.md) — the NVMe instance store is **not** auto-mounted on Ubuntu 24.04 boot.
16+
17+
## Expected microbench baselines
18+
19+
Run via [tooling/microbench.sh](tooling/microbench.sh). Expected order-of-magnitude numbers:
20+
21+
| Probe | Expected |
22+
|---|---|
23+
| sysbench cpu (1-thread events/sec) | ~25 k |
24+
| sysbench memory bandwidth (1-thread) | ~15 GiB/sec |
25+
| fio 4 k randwrite, QD=32, direct=1, on NVMe | ~300 k IOPS |
26+
| fio 4 k randwrite, bandwidth | ~1.2 GiB/sec |
27+
| fio 4 k randwrite, 99 p latency | ~100 µs |
28+
| fio 4 k randread, QD=32, on NVMe | ~400 k IOPS |
29+
30+
*Actual R7 microbench numbers to be filled in from the R7 microbench pass.*
31+
32+
## Postgres tuning (shared across all 7 VMs)
33+
34+
Applied by each VM's bootstrap (see `install/install_*.sh`):
35+
36+
```
37+
shared_preload_libraries = 'pg_stat_statements,pg_cron'
38+
cron.database_name = 'bench'
39+
40+
shared_buffers = 4GB
41+
effective_cache_size = 12GB
42+
43+
synchronous_commit = off
44+
wal_level = minimal
45+
wal_compression = lz4
46+
max_wal_size = 16GB
47+
checkpoint_completion_target = 0.9
48+
49+
bgwriter_delay = 50ms
50+
bgwriter_lru_maxpages = 400
51+
bgwriter_lru_multiplier = 4.0
52+
53+
random_page_cost = 1.1
54+
effective_io_concurrency = 200
55+
max_connections = 200
56+
57+
max_wal_senders = 0
58+
59+
autovacuum_vacuum_scale_factor = 0.01
60+
autovacuum_analyze_scale_factor = 0.01
61+
autovacuum_vacuum_cost_delay = 2ms
62+
63+
jit = off
64+
listen_addresses = 'localhost'
65+
```
66+
67+
`synchronous_commit=off` is deliberate — queue workloads are almost always idempotent at the application layer, and the WAL-flush path is the dominant cost for low-latency producers. It's the only PG knob we touch that materially changes safety posture.
68+
69+
`jit=off` because 5-s JIT warmups on `DO` blocks dominated our first-transaction latency in R4.
70+
71+
`autovacuum_*_scale_factor=0.01` is aggressive on purpose — we want autovacuum attempting to clean every 1 % dead-tuple ratio so the held-xmin phase of the bench exposes the *inability* to vacuum, not a lazy autovacuum schedule.

benchmark/METHODOLOGY.md

Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# Bench methodology — the definitive reference
2+
3+
Adapted from GitLab issue [postgres-ai/postgresql-consulting/tests-and-benchmarks#77, note 3263767264](https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues/77#note_3263767264).
4+
5+
This document is the single source of truth for how the PostgreSQL-queue bench is structured, what it measures, and every script/config that makes it work. It is written so a reviewer can reproduce the whole thing end-to-end.
6+
7+
Cross-links:
8+
9+
- Upstream fix: [pgque PR #62](https://github.com/NikolayS/pgque/pull/62) · issue [NikolayS/pgque#61](https://github.com/NikolayS/pgque/issues/61)
10+
- GitLab rounds: R4 final narrative — note 3262968456; R5 root — note 3263000815; R5 full writeup — note 3263438287; pgque rotation PoC under held xmin — note 3263417609.
11+
12+
---
13+
14+
## 1. Goals and pathology
15+
16+
We reproduce Brandur Leach's "Postgres queues" death spiral (see [Brandur's 2015 post](https://brandur.org/postgres-queues)): a DELETE-based work-queue becomes unvacuumably bloated the moment an unrelated backend holds `xmin` in the past (long transactions, logical-replication slots, stuck standby feedback). Dead tuples accumulate; bitmap / index scans must traverse them; latency explodes; throughput collapses even after the holder releases.
17+
18+
Seven systems, each on its own VM:
19+
20+
| System | Version | Pattern |
21+
|---|---|---|
22+
| **pgque** | patched via PR #62 | batch ticker + rotating event tables (TRUNCATE) |
23+
| pgq | v3.5.1 PL-only | same model as pgque; upstream baseline |
24+
| pgmq | v1.11.0 | single queue table, VISIBILITY + DELETE |
25+
| pgmq-partitioned | v1.11.0 + pg_partman | range-partitioned queue table |
26+
| river | v0.34 | SQL-level SKIP LOCKED + DELETE (consumer emulated) |
27+
| que | v2.4 | SKIP LOCKED + DELETE (consumer emulated) |
28+
| pg-boss | v12.15 | SKIP LOCKED + DELETE on partitioned `pgboss.job` |
29+
30+
The Go/Ruby/Node workers are installed end-to-end so the schema is authentic, but the actual consumer load is driven via pgbench running each system's *SQL claim pattern*. This isolates the DB-side behaviour from runtime/GC artefacts.
31+
32+
Workload shape (all runs):
33+
34+
- Producer: 1 client, pgbench `-R 1000` rate-cap (may move to `-R 5000` in R7/R8)
35+
- Consumer: 1 client for pgque/pgq, 4 clients for everything else
36+
- Three 30-minute phases back-to-back = **1.5 h per run**:
37+
1. **Clean baseline** — no held xmin
38+
2. **Held xmin**`idle_in_tx.py` holds `REPEATABLE READ` open
39+
3. **Clean recovery** — holder killed, observe regrowth / catchup
40+
41+
---
42+
43+
## 2. Infrastructure
44+
45+
- AWS us-east-2, **i4i.2xlarge** (8 vCPU, 64 GB, NVMe instance store)
46+
- Spot where available; on-demand only as last resort (see Section 10)
47+
- Ubuntu 24.04, **PG18 from PGDG**, `pg_cron`, `pg_stat_statements`, pg_ash, pgfr
48+
- Data dir moved to NVMe: `/mnt/pgdata/postgresql/18/main` symlink (see [runners/fix_nvme_mount.sh](runners/fix_nvme_mount.sh) and [OPS_GOTCHAS.md §1](OPS_GOTCHAS.md))
49+
- One VM per system so any tuning / runtime / GC behaviour is contained
50+
- SSH key: `<your-ssh-key>` (us-east-2)
51+
52+
Live R7 VMs (redacted): `<pgque-ip>`, `<pgq-ip>`, `<pgmq-ip>`, `<river-ip>`, `<pgboss-ip>`, `<que-ip>`, `<pgmq-partitioned-ip>`.
53+
54+
---
55+
56+
## 3. Observability stack
57+
58+
Seven parallel streams run during every bench. All CSVs land in `/tmp/bench/` and are rsynced to the local `/tmp/bench_r<N>/<system>/` tree every 30 min.
59+
60+
### (a) bloat_sampler.py — pg_stat_user_tables every 30 s
61+
62+
Per-system filter covers each queue's event/metadata tables (including pg-boss partitions and pgmq-partitioned's partman partitions). Writes `bloat.csv`.
63+
64+
Source: [tooling/bloat_sampler.py](tooling/bloat_sampler.py).
65+
66+
### (b) pg_ash — 1 Hz wait-event sampling
67+
68+
pg_cron runs ash sampling at 1 Hz throughout; at end of bench we export:
69+
70+
```sql
71+
COPY (SELECT sample_time, database_name, active_backends, wait_event, query_id
72+
FROM ash.samples(p_interval => '2 hour'::interval, p_limit => 2000000))
73+
TO '/tmp/bench/ash.csv' CSV HEADER;
74+
```
75+
76+
### (c) pgfr (pg-flight-recorder) — snapshot observability
77+
78+
pgfr writes snapshots on a pg_cron schedule into `pgfr_record.*`. At end of run we export:
79+
80+
```sql
81+
COPY (SELECT * FROM pgfr_record.snapshots) TO '/tmp/bench/pgfr_snapshots.csv' CSV HEADER;
82+
COPY (SELECT * FROM pgfr_record.table_snapshots) TO '/tmp/bench/pgfr_table_snapshots.csv' CSV HEADER;
83+
COPY (SELECT * FROM pgfr_record.statement_snapshots) TO '/tmp/bench/pgfr_statement_snapshots.csv' CSV HEADER;
84+
```
85+
86+
Full pgfr_record schema on every VM includes: `snapshots`/`snapshots_v2` (partitioned), `table_snapshots(_v2)`, `statement_snapshots(_v2)`, `index_snapshots(_v2)`, `replication_snapshots(_v2)`, `vacuum_progress_snapshots(_v2)`, `activity_samples(_archive_v2)`, `lock_samples(_archive_v2)`, `config_snapshots`, `db_role_config_snapshots`.
87+
88+
### (d) sys_metrics_sampler.py — CPU / mem / disk every 10 s (R7+)
89+
90+
Reads `/proc/stat`, `/proc/meminfo`, `/proc/diskstats` directly (psutil-optional). NVMe device is `nvme1n1` (the instance store, which is what `/mnt/pgdata` sits on). v2 adds per-device IOPS and latency columns.
91+
92+
Source: [tooling/sys_metrics_sampler.py](tooling/sys_metrics_sampler.py).
93+
94+
### (e) pg_stat_statements_snapshot.py — pgss time-series every 10 s
95+
96+
Polls pgss with a regex filter over our queue-related query shapes, diffs consecutive snapshots downstream. Used as a cross-check for NOTICE-based ev/s (important for pgque/pgq/pgmq where DO-block wrappers hide per-statement rows).
97+
98+
Source: [tooling/pg_stat_statements_snapshot.py](tooling/pg_stat_statements_snapshot.py).
99+
100+
### (f) pgbench `--aggregate-interval=10 --log`
101+
102+
Both producer and consumer run with per-10 s aggregate logs (min/max/sum/sumsq latency). Files: `producer_agg.<pid>` and `consumer_agg.<pid>.<worker>` under `/tmp/bench/`.
103+
104+
### (g) Consumer NOTICE instrumentation (R6+)
105+
106+
Each instrumented `consumer.sql` is wrapped in a DO block that emits exactly one `RAISE NOTICE 'ev ts=<epoch_s> n=<events>'` per call. This gives us an authoritative per-call consumed-events stream that is immune to the pgss DO-wrapper opacity problem.
107+
108+
The seven instrumented consumers are in [consumers/](consumers/). The parser for `consumer.log` NOTICE lines is [tooling/parse_events_consumed.py](tooling/parse_events_consumed.py).
109+
110+
### (h) idle_in_tx.py — the death-spiral inducer
111+
112+
Opens a `REPEATABLE READ` transaction, pins xmin, sleeps forever. Kill with SIGTERM to release the horizon. Source: [tooling/idle_in_tx.py](tooling/idle_in_tx.py).
113+
114+
### (i) pgq_ticker_daemon.py — tight ticker loop for pgq
115+
116+
pgq upstream has no built-in ticker daemon (the C one was always external). pgque runs inline; for a fair comparison we run a tight Python loop on the pgq VM that calls `pgq.ticker()` at 1 Hz and `pgq.maint_operations()` every 5 s. Source: [tooling/pgq_ticker_daemon.py](tooling/pgq_ticker_daemon.py).
117+
118+
---
119+
120+
## 4. Bench runner
121+
122+
Orchestrates the phase scheduler. Forks producer, consumer (both pgbench), bloat_sampler, sys_metrics_sampler, pgss snapshotter; at t=1800 s opens `idle_in_tx`; runs `VACUUM VERBOSE` at phase boundaries; at t=3600 s kills the holder; another `VACUUM VERBOSE`; at t=5400 s harvests ash/pgfr/pgss.
123+
124+
Source: [runners/run_r7.sh](runners/run_r7.sh).
125+
126+
---
127+
128+
## 5. Clean-slate reset
129+
130+
Before every run we kill stragglers, DROP the system's schema/extensions, unschedule lingering `pg_cron` jobs, re-run `install.sh`, and reset all stats. The companion `full_reset.sql` calls `pg_stat_statements_reset()`, `pg_stat_reset()`, `pg_stat_reset_shared()` for bgwriter/checkpointer/wal/io, `TRUNCATE ash.sample`, and `DELETE FROM cron.job_run_details`.
131+
132+
Source: [runners/clean_reinstall.sh](runners/clean_reinstall.sh). See also [OPS_GOTCHAS.md §4, §5, §7](OPS_GOTCHAS.md) for adjacent-schema pitfalls.
133+
134+
---
135+
136+
## 6. Per-system install scripts
137+
138+
Each VM has its own `/tmp/install.sh` (mirrored into [install/](install/)). Pattern differs by system:
139+
140+
- **pgque**`git clone` the PR #62 branch, `make USE_PGXS=1 install`, run the SQL installer, `SELECT pgque.create_queue('bench_queue')`, schedule `pgque.ticker()` via `pg_cron` (also called inline in `consumer.sql`). Driven from AMI user-data; see [install/README.md](install/README.md) and [install/bootstrap.sh](install/bootstrap.sh).
141+
- **pgq** — PGDG `postgresql-18-pgq3` or `git clone --branch v3.5.1`, then `CREATE EXTENSION pgq` and immediately apply `switch_plonly.sql` to replace the C `insert_event_raw` with PL/pgSQL (pg_proc lang check `= 'plpgsql'` as a gate). See [install/install_pgq.sh](install/install_pgq.sh).
142+
- **pgmq** — PGDG `postgresql-18-pgmq` + `CREATE EXTENSION pgmq` + `pgmq.create('bench_queue')`. See [install/install_pgmq.sh](install/install_pgmq.sh).
143+
- **pgmq-partitioned** — pgmq + `pg_partman` + `pgmq.create_partitioned('bench_queue', 'id', '10000')`. Reuses [install/install_pgmq.sh](install/install_pgmq.sh) plus [install/pgmq-partitioned_setup_5min.sql](install/pgmq-partitioned_setup_5min.sql) for the partman schedule.
144+
- **river**`go install github.com/riverqueue/river/cmd/river@v0.34.x` and `river migrate-up` (sets schema), then the consumer SQL emulates the SKIP-LOCKED claim pattern. See [install/install_river.sh](install/install_river.sh).
145+
- **que** — Ruby + `gem install que -v 2.4.x`, `bundle exec que:install` creates `que_jobs`; consumer SQL emulates the Ruby Que SELECT/DELETE. Driven from AMI user-data — see [install/README.md](install/README.md) and [OPS_GOTCHAS.md §7](OPS_GOTCHAS.md).
146+
- **pg-boss** — local `npm install pg-boss@12.15`, run `new PgBoss(DSN).start()` once to migrate schema; consumer SQL emulates the `SKIP LOCKED` claim on `pgboss.job`. See [install/install_pgboss.sh](install/install_pgboss.sh).
147+
148+
---
149+
150+
## 7. Per-system consumer.sql patterns
151+
152+
All seven instrumented consumers are in [consumers/](consumers/):
153+
154+
- [consumer_pgque.sql](consumers/consumer_pgque.sql) — ticker + `next_batch` + `get_batch_events` + `finish_batch`
155+
- [consumer_pgq.sql](consumers/consumer_pgq.sql) — identical, schema swap pgque → pgq
156+
- [consumer_pgmq.sql](consumers/consumer_pgmq.sql)`pgmq.read(50)` + `pgmq.delete`
157+
- [consumer_pgmq-partitioned.sql](consumers/consumer_pgmq-partitioned.sql) — identical SQL to pgmq (schema hides partitioning)
158+
- [consumer_river.sql](consumers/consumer_river.sql) — SKIP LOCKED + DELETE on `river_job`
159+
- [consumer_que.sql](consumers/consumer_que.sql) — SKIP LOCKED + DELETE on `que_jobs`
160+
- [consumer_pgboss.sql](consumers/consumer_pgboss.sql) — SKIP LOCKED + DELETE on `pgboss.job`
161+
162+
All producers are in [producers/](producers/).
163+
164+
---
165+
166+
## 8. Analysis and chart generation
167+
168+
- [charts/r5_analyze.py](charts/r5_analyze.py) — R5 full 2-panel chart (dead tuples + consumer latency, linear scale, no symlog — per [MEMORY rule](../CLAUDE.md) "Never use log/symlog on charts").
169+
- [charts/r6_smoke_chart.py](charts/r6_smoke_chart.py) — R6 smoke Solarized-Dark 2-panel chart (events/s + pgque per-table dead tuples).
170+
- [gifs/r4_gif_v17_solarized.py](gifs/r4_gif_v17_solarized.py) — R4 dead-tuples animated GIF (7 systems, Solarized-Dark).
171+
- [gifs/r4_gif_tps_solarized.py](gifs/r4_gif_tps_solarized.py) — R4 TPS/latency animated GIF.
172+
173+
---
174+
175+
## 9. GitLab posting style
176+
177+
- Threads open under `### header`; replies stay inside the same discussion.
178+
- Every `<details>` block has **one blank line** between the opening tag and the triple-backtick fence, and **one blank line** between the closing fence and `</details>` — and another blank line after `</details>`. This is what makes them render correctly on GitLab.
179+
- Every significant result embeds `![alt](/uploads/…)` — charts are first-class citizens. GIFs too.
180+
- Post bodies via `curl --data-urlencode "body@/tmp/path"` to the discussions endpoint — new thread, not a reply under an existing one.
181+
182+
---
183+
184+
## 10. Cost discipline
185+
186+
- us-east-2 spot preferred (~$0.22/h); when spot is exhausted we hop to another region before going on-demand.
187+
- At R7 time 2 VMs are on-demand (pgmq-partitioned and pgboss; pgque+que moved on-demand after R6 spot reclaim).
188+
- **~$15 per 1.5 h bench round** (7 VMs).
189+
- Total spend through R7 is ~$100.
190+
- Rsync every 30 min pulls `/tmp/bench/` to local, so a spot reclaim loses at most a partial phase.
191+
192+
See [OPS_GOTCHAS.md §12](OPS_GOTCHAS.md) for spot-reclaim mitigations.
193+
194+
---
195+
196+
## Reproducing from scratch
197+
198+
1. Launch 7 × i4i.2xlarge in us-east-2 (spot first) with `bench_userdata.sh` (symlinks `/mnt/pgdata`, installs PGDG PG18, pg_cron, pg_ash, pgfr, pg_stat_statements). See [runners/fix_nvme_mount.sh](runners/fix_nvme_mount.sh) for the NVMe side if the userdata path breaks.
199+
2. `scp` per-system `install.sh`, `producer.sql`, instrumented `consumer.sql`, plus the samplers in [tooling/](tooling/).
200+
3. `bash clean_reinstall.sh <sys>` on each VM.
201+
4. `bash run_r7.sh <sys>` — 1.5 h, writes `/tmp/bench/*.csv` + `*.log`.
202+
5. Rsync everything back to `/tmp/bench_r7/<sys>/`.
203+
6. Run [charts/r5_analyze.py](charts/r5_analyze.py) (or R7 successor) for the verdict table + 2-panel PNG.
204+
7. Post results as a threaded reply with the `<details>`/blank-line style above.
205+
206+
That is the whole rig.

0 commit comments

Comments
 (0)