Architecture overview

linkdb extends PostgreSQL 16.13 with a cluster subsystem that targets shared-disk multi-node deployment patterns (akin to Oracle Real Application Clusters). This document gives a 30,000-foot view of the components present today and the boundary between cluster code and stock PostgreSQL.

High-level component map

+-----------------------------------------------------------+
|  PostgreSQL 16.13 backend (parser / planner / executor)   |
|                          ...                              |
+-----------------------------------------------------------+
|  PostgreSQL infrastructure (shmem / WAL / lock / buffer)  |
+-----------+----------+----------+--------------+----------+
            |          |          |              |
            v          v          v              v
+-----------+ +--------+ +--------+ +------------+
| cluster   | | cluster| | cluster| | cluster    |
| _guc      | | _shmem | | _ic    | | _conf      |
| (GUCs)    | |(shared | |(IPC    | |(topology   |
|           | | memory)| | API)   | | from       |
|           | |        | |        | | pgrac.conf)|
+-----------+ +--------+ +--------+ +------------+
            |          |          |              |
            +----------+----------+--------------+
                          |
                          v
            +-----------------------------+
            | cluster_views (SQL surface) |
            |  pg_cluster_nodes           |
            |  pg_stat_cluster_wait_events|
            |  pg_stat_gcluster_wait_events|
            +-----------------------------+

Modules at a glance

Module	Source	Purpose
`cluster`	`src/backend/cluster/cluster.c`	Top-level entry point (init / shutdown stubs)
`cluster_guc`	`src/backend/cluster/cluster_guc.c`	Registers `cluster.*` GUCs (node_id / interconnect_tier / config_file)
`cluster_shmem`	`src/backend/cluster/cluster_shmem.c`	Cluster shared-memory control block + per-subsystem allocators
`cluster_ic`	`src/backend/cluster/cluster_ic.c`	Inter-node IPC abstraction (byte-stream + protocol layers; vtable per tier)
`cluster_conf`	`src/backend/cluster/cluster_conf.c`	Parses `pgrac.conf` into the topology shmem region
`cluster_signal`	`src/backend/cluster/cluster_signal.c`	Cluster ProcSignal handlers (extends PG SIGUSR1 multiplexer)
`cluster_views`	`src/backend/cluster/cluster_views.c`	Set-returning functions backing the three cluster system views
`cluster_elog`	`src/backend/cluster/cluster_elog.c`	`CLUSTER_LOG()` macro and the `cluster_phase` lifecycle marker
`cluster_smgr`	`src/backend/cluster/storage/cluster_smgr.c`	Cluster-aware storage manager. Becomes a second `smgrsw[]` entry when `cluster.smgr_user_relations = on`;permanent user relations route through this path so two cluster instances can each open the same relation concurrently.
`cluster_shared_fs`	`src/backend/cluster/storage/cluster_shared_fs.c`	Per-cluster shared filesystem backend pluggable behind `cluster_smgr` (currently `local` only).
`cluster_qvotec`	`src/backend/cluster/cluster_qvotec.c`	Quorum voting coordinator background process. Polls voting disks, computes the cluster majority view, and signals backends to fail closed when quorum is lost. Catalog surface (GUCs / views / wait events) currently active; the runtime spawn integration is held in a follow-up release.
`cluster_voting_disk_io`	`src/backend/cluster/cluster_voting_disk_io.c`	Voting-disk slot read / write primitives: `O_DIRECT` best-effort with `O_SYNC` fallback, generation-and-CRC32C torn-write detection. Used by the coordinator.
`cluster_quorum_decision`	`src/backend/cluster/cluster_quorum_decision.c`	Pure-function majority math + cross-instance node-id collision detection over the slot matrix observed by `cluster_qvotec`.

Quorum coordination

The cluster operates in fail-closed mode: writes are permitted only while a healthy majority of voting disks confirm membership. The voting disks are pre-allocated files on shared storage configured via cluster.voting_disks; each instance owns a 512-byte slot indexed by its node_id.

Every cluster.quorum_poll_interval_ms, the coordinator (a) writes its own slot with current heartbeat / epoch / incarnation, then (b) reads every disk and combines the slot matrix into a ClusterQuorumView (alive bitmap, max observed epoch, collision report). Backends use a lease window of 2 × the poll interval to gate write transactions: on COMMIT, a backend that finds the lease expired or the view marked not-OK aborts with SQLSTATE 53R40 (ERRCODE_CLUSTER_QUORUM_LOST).

If a stale instance restarts with the same node_id as a peer that is already serving, the newer instance fails fatal at first poll (SQLSTATE 53R43) so the older serving instance keeps the slot.

The pg_cluster_quorum_state and pg_cluster_voting_disks views expose the current state. See system-views.md for column reference.

Status. The catalog surface ships in this release; the coordinator's postmaster spawn integration is held in a follow-up release. Until then, pg_cluster_quorum_state.in_quorum reports the fail-closed default.

In-flight transaction management on quorum loss

When the cluster transitions from healthy to lost-quorum, two independent defenses fire to protect data integrity:

Commit-boundary fail-closed gate (always-on). Every CommitTransaction call consults the lease-aware cluster_qvotec_in_quorum() predicate;a false return raises ERRCODE_CLUSTER_QUORUM_LOST (SQLSTATE 53R40). No transaction that survives the lease window can commit through a quorum loss.
In-flight PROCSIG_CLUSTER_FREEZE_WRITES broadcast. The cluster coordinator (LMON) immediately signals every live backend; the next CHECK_FOR_INTERRUPTS after the signal raises ERRCODE_CLUSTER_QUORUM_LOST_BACKEND (SQLSTATE 53R50) and rolls back the transaction, even if it has not yet reached its commit boundary. Long-running queries are aborted within milliseconds instead of seconds, removing wasted work and locks.

If the quorum loss persists for cluster.self_fence_grace_ms (default 30 s), the postmaster initiates a fast shutdown via SIGINT. Backends receive SIGTERM, dirty buffers flush, and recovery resumes from the voting-disk slot on the next start. This is a defense-in-depth backstop — the commit gate already prevents incorrect commits — but it removes inconsistent state from the cluster within a bounded window.

The pg_cluster_fence_state view exposes per-backend signal counters, broadcast timestamps, and the self-fence-pending state. The behaviour is governed by four PGC_POSTMASTER GUCs: cluster.self_fence_enabled / cluster.self_fence_grace_ms / cluster.freeze_writes_enabled / cluster.fence_audit_log (see configuration.md for full reference).

Status. The fence-lite path ships with the catalog surface, unit coverage for the signal/self-fence contracts, and 2-node steady-state TAP coverage. End-to-end freeze/thaw broadcast TAP is deferred until the voting-disk I/O failure injection harness exists. External fence command integration (IPMI / STONITH / cloud API) is not in scope and will land in a future watchdog spec.

Reconfig coordinator

When the cluster sub-system service cssd (cluster synchronization service) detects that a peer has stopped sending heartbeats for longer than the deadband window, the surviving nodes run a deterministic reconfig coordinator:

Every surviving node's lmon (lock monitor) reads the current peer liveness state from cssd once per tick.
The set of declared peers (excluding the dead set and any local node that has lost quorum) is the survivor set.
The lowest node_id in the survivor set is the coordinator. Every surviving node computes the same coordinator independently — there is no leader election round-trip.
Every surviving in-quorum node broadcasts PROCSIG_CLUSTER_RECONFIG_START to its own local backends. Backends notice this at the next CHECK_FOR_INTERRUPTS and abort any in-flight writable transaction with SQLSTATE 53R60 (ERRCODE_CLUSTER_RECONFIG_IN_PROGRESS). Read-only transactions absorb silently.
Only the coordinator advances the cluster membership epoch. The new epoch is carried on the next outbound cluster_ic envelope from each node;peers observe and converge via Lamport piggyback.
pg_cluster_reconfig_state records the applied event: event id, coordinator node id, old/new epoch, dead bitmap, applied timestamp, observer role (coordinator / survivor / none) and a snapshot of the cssd dead-generation counter that disambiguates a rejoin-then- redeath from a single sustained outage.

peer dies
  |
  +- cssd deadband on every surviving node
  |     |
  |     +- pg_cluster_cssd_peers shows peer state = suspected → dead
  |
  +- lmon tick on every in-quorum survivor
        |
        +- compute survivor set + coordinator (deterministic)
        |
        +- broadcast PROCSIG_CLUSTER_RECONFIG_START to local backends
        |     |
        |     +- backend ProcessInterrupts:
        |           writable tx → ERROR (53R60) → rollback + retry safe
        |           read-only / idle → absorb
        |
        +- if self == coordinator: epoch ++ + record event
        +- else                 : record observer event (no epoch++)

The retry path is deliberately fast. Clients that observe 53R60 should retry the transaction immediately;the next attempt runs under the new membership epoch. The coordinator broadcast and per-node PROCSIG delivery typically complete in a single LMON tick (default 1000 ms), so the visible downtime for a writable client is bounded by the cssd deadband plus one tick.

Status. Reconfig coordinator A-scope (internal-only) ships with a single-tick deterministic-coordinator design, no phase machine, no voting-disk persistence of reconfig events, and no peer-fence actor. The fail-closed authority remains cluster_qvotec_in_quorum() + the spec-2.28 freeze gate + the commit gate;the reconfig coordinator only advances the epoch and wakes backends — it does not itself prevent a write. Larger coordination flows (peer-fence broadcast, phase machine, voting-disk event log, dynamic election) are future spec work.

Postmaster startup flow

postmaster main
  |
  +- ProcessConfigFile (postgresql.conf)
  |     |
  |     +- cluster.* GUCs registered via cluster_init_guc
  |        (process_shared_preload_libraries phase)
  |
  +- CreateSharedMemoryAndSemaphores
        |
        +- cluster_init_shmem
              |
              +- cluster_ctl_shmem_init      (allocate cluster control block)
              +- cluster_conf_shmem_init     (allocate topology shmem)
              +- cluster_conf_load           (parse pgrac.conf or fall back single-node)
              +- cluster_ic_init             (bind interconnect vtable based on tier GUC)

Build modes

Build flag	Behavior
`--enable-cluster`	Cluster subsystem active. All cluster GUCs / views / SRFs present. `cluster_*.o` objects linked into `postgres`.
`--disable-cluster` (default)	Cluster subsystem absent. Cluster GUCs not recognised. Cluster system views return zero rows. `postgres` binary is symbol-equivalent to upstream PostgreSQL 16.13 (no `cluster_init` / `cluster_shutdown` / `pgrac_version_string` symbols).

The --disable-cluster mode is useful when running linkdb as a drop-in PostgreSQL replacement without enabling cluster features — e.g. on a standalone development machine.

Interconnect tiers

The cluster_ic module defines four tier slots. The first two are populated in the current release.

Tier	Status
`stub`	Active — same-node ops are no-ops, cross-node ops raise `ERRCODE_FEATURE_NOT_SUPPORTED`
`tier1`	Active — TCP transport carrying the LMON aux-process heartbeat between every pair of nodes declared in `pgrac.conf`. Per-peer state is exposed via `pg_cluster_ic_peers`.
`tier2`	Not supported
`tier3`	Not supported

Switching tiers is controlled via the cluster.interconnect_tier GUC. See Configuration for runtime details.

Storage manager dispatch

linkdb extends PostgreSQL's smgrsw[] array from one to two entries when built with --enable-cluster. The first entry is PostgreSQL's stock md.c;the second is cluster_smgr, a cluster-aware storage manager that bridges into the cluster_shared_fs vtable.

smgr_which	Owner	Used when
`0`	`md.c` (stock PG)	Temporary relations always; permanent relations whenever `cluster.smgr_user_relations = off` (the default) or `cluster.shared_storage_backend = stub`.
`1`	`cluster_smgr`	Permanent relations only, and only when both `cluster.shared_storage_backend != stub` AND `cluster.smgr_user_relations = on`. Two cluster instances may have a backend each open the same relation concurrently;each instance keeps process-local SMgrRelation state and process-local file handles.

Cache invalidation across cluster instances is wired through three hook points (relation, relation map, and unlink-pending) that fire from the corresponding PostgreSQL invalidation paths. The hook bodies bump a single counter — visible via pg_stat_cluster_counters under cluster.smgr.remote_invalidation_stub_call_count — and the unlink-pending hook additionally closes the per-process file handle for the unlinked relation. Cross-instance signal propagation is not yet activated;peer instances may briefly observe stale cached state until invalidation completes through PostgreSQL's regular sinval queue.

The cluster.smgr_user_relations GUC is experimental; enabling it raises a postmaster startup WARNING and is unsuitable for production workloads. See Configuration.

Message envelope

Every message that crosses the interconnect — regardless of which tier carries it — is framed by a fixed 36-byte little-endian header (the cluster interconnect envelope) followed by a per-message-type payload. The envelope carries the wire-protocol version, message type identifier, sender / destination node IDs, payload length, and a CRC covering the whole frame. All multi-byte fields are 4-byte aligned.

Each message type is registered once at postmaster startup with a stable numeric identifier (msg_type), a name, and metadata describing which backend types are allowed to produce it and whether it may be broadcast. The runtime catalogue is exposed via pg_cluster_ic_msg_types. The current release registers one type:

`msg_type`	Name	Producer	Notes
`1`	`heartbeat`	LMON aux process	Point-to-point liveness probe; no payload other than the envelope itself.

Adding a new interconnect message type therefore means: pick the next free msg_type, register it during postmaster startup with its allowed producer set + handler, and define the payload layout. The wire envelope itself stays unchanged.

Membership epoch and SCN piggyback

Every envelope carries two cross-cutting fields beyond its message identification:

Membership epoch is a node-local monotonic counter incremented whenever the cluster's membership changes (a peer joins or leaves). The receiver rejects any envelope whose carried epoch does not match the local current epoch — pre-reconfig in-flight frames are dropped on the recv path with a logged message and a per-peer stale_epoch_drop_count increment. No frame is acted upon during a membership transition, so post-reconfig state is never contaminated by stale traffic.
SCN piggyback (Lamport-style) lets every received envelope advance the local SCN clock if the carried SCN is greater than the current local value. Verification happens before the advance, so forged or stale-epoch frames cannot push the SCN forward.

Large payloads and framing

The 36-byte envelope caps a single message frame at 16 MB of payload. When a sender needs to deliver a larger payload it uses the chunked framing layer, which splits the buffer into N consecutive frames each carrying a 16-byte chunk header that records chunk_seq, chunk_total, the original total_payload_len, and the wrapped inner msg_type. The receiver assembles the chunks into a per-peer dedicated memory context; on completion the inner message is dispatched as if it had arrived in a single frame. If a peer falls silent partway through a reassembly, the receiver times out (10 s by default), logs the event, and drops the connection rather than holding the half-built buffer. The default upper bound on a chunked payload is 64 MB, configurable up to a 256 MB hard cap via cluster.interconnect_payload_max_bytes.

Liveness — application heartbeat plus TCP keepalive

Liveness is enforced on two layers. The primary path is an application-level heartbeat that LMON sends at a configurable interval (default 1 s). Three missed acks mark the peer down and trigger reconnection — typically within ~3 s of a real failure. Beneath that the kernel applies TCP keepalive probes, configured via cluster.interconnect_tcp_keepidle_sec / _tcp_keepintvl_sec / _tcp_keepcnt. This is a fallback for the rare case where the application path stalls but the socket stays open; with the defaults the kernel needs at most 120 s to declare the connection dead.

Topology

The cluster topology — list of nodes plus their addresses — is declared in a single file (pgrac.conf) loaded at postmaster startup. The parsed result lives in shared memory so every backend sees an identical view. Updates require a restart in the current release.

See Configuration for the pgrac.conf format and System views for the pg_cluster_nodes runtime projection.

What's not in the picture yet

Several subsystems referenced in the wait event registry (see Wait events) are scaffolded but not active in the current release. Operations they would serve return ERRCODE_FEATURE_NOT_SUPPORTED rather than blocking on a wait.

Subsystem	Status
GES (distributed lock manager)	Not active
PCM (parallel cache management)	Not active
Cache Fusion (cross-node block transfer)	Not active
SCN propagation	Not active
Reconfiguration / Heartbeat	Not active
Cluster-aware recovery	Not active
Cross-node sinval	Not active
Cross-node undo	Not active
Active Data Guard	Not active

Reporting issues

File issues at https://github.com/sqlrush/linkdb/issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture overview

High-level component map

Modules at a glance

Quorum coordination

In-flight transaction management on quorum loss

Reconfig coordinator

Postmaster startup flow

Build modes

Interconnect tiers

Storage manager dispatch

Message envelope

Membership epoch and SCN piggyback

Large payloads and framing

Liveness — application heartbeat plus TCP keepalive

Topology

What's not in the picture yet

Reporting issues

FilesExpand file tree

overview.md

Latest commit

History

overview.md

File metadata and controls

Architecture overview

High-level component map

Modules at a glance

Quorum coordination

In-flight transaction management on quorum loss

Reconfig coordinator

Postmaster startup flow

Build modes

Interconnect tiers

Storage manager dispatch

Message envelope

Membership epoch and SCN piggyback

Large payloads and framing

Liveness — application heartbeat plus TCP keepalive

Topology

What's not in the picture yet

Reporting issues