|
| 1 | +# pgraft System Architecture |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +pgraft implements a distributed consensus system using the Raft algorithm integrated with PostgreSQL. This document describes the overall system architecture, component interactions, and operational flows. |
| 6 | + |
| 7 | +## System Components |
| 8 | + |
| 9 | +### 1. PostgreSQL Cluster Nodes |
| 10 | +Each PostgreSQL instance in the cluster runs the pgraft extension and participates in the Raft consensus protocol. |
| 11 | + |
| 12 | +### 2. Raft Consensus Layer |
| 13 | +The core consensus engine implemented in Go, providing: |
| 14 | +- Leader election |
| 15 | +- Log replication |
| 16 | +- Cluster membership management |
| 17 | +- Failure detection and recovery |
| 18 | + |
| 19 | +### 3. Network Communication |
| 20 | +TCP-based peer-to-peer communication between cluster nodes for: |
| 21 | +- Raft protocol messages |
| 22 | +- Heartbeat signals |
| 23 | +- Log replication |
| 24 | +- Configuration changes |
| 25 | + |
| 26 | +### 4. Shared Memory Interface |
| 27 | +PostgreSQL shared memory used for: |
| 28 | +- Command queue between SQL and background worker |
| 29 | +- Cluster state persistence |
| 30 | +- Worker status tracking |
| 31 | +- Command status monitoring |
| 32 | + |
| 33 | +## High-Level Architecture |
| 34 | + |
| 35 | +``` |
| 36 | +┌─────────────────────────────────────────────────────────────────┐ |
| 37 | +│ PostgreSQL Cluster │ |
| 38 | +├─────────────────┬─────────────────┬─────────────────────────────┤ |
| 39 | +│ Node 1 │ Node 2 │ Node 3 │ |
| 40 | +│ │ │ │ |
| 41 | +│ ┌─────────────┐ │ ┌─────────────┐ │ ┌─────────────┐ │ |
| 42 | +│ │ PostgreSQL │ │ │ PostgreSQL │ │ │ PostgreSQL │ │ |
| 43 | +│ │ Server │ │ │ Server │ │ │ Server │ │ |
| 44 | +│ └─────────────┘ │ └─────────────┘ │ └─────────────┘ │ |
| 45 | +│ │ │ │ │ │ │ |
| 46 | +│ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐ │ |
| 47 | +│ │ pgraft │ │ pgraft │ │ pgraft │ │ |
| 48 | +│ │ Extension │ │ Extension │ │ Extension │ │ |
| 49 | +│ └───────┬───────┼─────────┬───────┼─────────┬───────┘ │ |
| 50 | +│ │ │ │ │ │ │ |
| 51 | +│ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐ │ |
| 52 | +│ │ Background │ │ Background │ │ Background │ │ |
| 53 | +│ │ Worker │ │ Worker │ │ Worker │ │ |
| 54 | +│ └───────┬───────┼─────────┬───────┼─────────┬───────┘ │ |
| 55 | +│ │ │ │ │ │ │ |
| 56 | +│ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐ │ |
| 57 | +│ │ Go Raft │ │ Go Raft │ │ Go Raft │ │ |
| 58 | +│ │ Library │ │ Library │ │ Library │ │ |
| 59 | +│ └───────┬───────┼─────────┬───────┼─────────┬───────┘ │ |
| 60 | +│ │ │ │ │ │ │ |
| 61 | +└─────────┼───────┼─────────┼───────┼─────────┼───────────────────┘ |
| 62 | + │ │ │ │ │ |
| 63 | + └───────┼─────────┼───────┼─────────┘ |
| 64 | + │ │ │ |
| 65 | + ┌───────▼─────────▼───────▼───────┐ |
| 66 | + │ Network Layer │ |
| 67 | + │ (TCP Peer Communication) │ |
| 68 | + └─────────────────────────────────┘ |
| 69 | +``` |
| 70 | + |
| 71 | +## Component Interaction Flow |
| 72 | + |
| 73 | +### 1. Cluster Initialization |
| 74 | + |
| 75 | +```mermaid |
| 76 | +sequenceDiagram |
| 77 | + participant U as User |
| 78 | + participant N1 as Node 1 |
| 79 | + participant N2 as Node 2 |
| 80 | + participant N3 as Node 3 |
| 81 | + |
| 82 | + U->>N1: SELECT pgraft_init() |
| 83 | + N1->>N1: Start background worker |
| 84 | + N1->>N1: Initialize Raft node |
| 85 | + N1->>N1: Start network server |
| 86 | + |
| 87 | + U->>N2: SELECT pgraft_add_node('node2:5433') |
| 88 | + N1->>N2: Connect to node 2 |
| 89 | + N2->>N2: Start background worker |
| 90 | + N2->>N2: Join cluster |
| 91 | + |
| 92 | + U->>N3: SELECT pgraft_add_node('node3:5433') |
| 93 | + N1->>N3: Connect to node 3 |
| 94 | + N3->>N3: Start background worker |
| 95 | + N3->>N3: Join cluster |
| 96 | + |
| 97 | + Note over N1,N3: Cluster formed with 3 nodes |
| 98 | +``` |
| 99 | + |
| 100 | +### 2. Leader Election Process |
| 101 | + |
| 102 | +```mermaid |
| 103 | +sequenceDiagram |
| 104 | + participant N1 as Node 1 (Leader) |
| 105 | + participant N2 as Node 2 (Follower) |
| 106 | + participant N3 as Node 3 (Follower) |
| 107 | + |
| 108 | + loop Heartbeat |
| 109 | + N1->>N2: AppendEntries (heartbeat) |
| 110 | + N1->>N3: AppendEntries (heartbeat) |
| 111 | + N2->>N1: AppendEntries Response |
| 112 | + N3->>N1: AppendEntries Response |
| 113 | + end |
| 114 | + |
| 115 | + Note over N1: Leader fails |
| 116 | + N2->>N2: Election timeout |
| 117 | + N2->>N3: RequestVote |
| 118 | + N3->>N2: Vote granted |
| 119 | + N2->>N2: Become leader |
| 120 | + N2->>N3: AppendEntries (heartbeat) |
| 121 | +``` |
| 122 | + |
| 123 | +### 3. Log Replication |
| 124 | + |
| 125 | +```mermaid |
| 126 | +sequenceDiagram |
| 127 | + participant U as User |
| 128 | + participant L as Leader |
| 129 | + participant F1 as Follower 1 |
| 130 | + participant F2 as Follower 2 |
| 131 | + |
| 132 | + U->>L: INSERT/UPDATE/DELETE |
| 133 | + L->>L: Append to log |
| 134 | + L->>F1: AppendEntries (log entry) |
| 135 | + L->>F2: AppendEntries (log entry) |
| 136 | + F1->>L: AppendEntries Response |
| 137 | + F2->>L: AppendEntries Response |
| 138 | + L->>L: Commit entry |
| 139 | + L->>F1: AppendEntries (commit) |
| 140 | + L->>F2: AppendEntries (commit) |
| 141 | + L->>U: Transaction committed |
| 142 | +``` |
| 143 | + |
| 144 | +## Data Flow Architecture |
| 145 | + |
| 146 | +### 1. Command Processing Flow |
| 147 | + |
| 148 | +``` |
| 149 | +SQL Function → Command Queue → Background Worker → Go Raft Library → Network |
| 150 | + ↑ ↓ |
| 151 | + └─────────── Command Status ← Shared Memory ← Raft State ←──────────┘ |
| 152 | +``` |
| 153 | + |
| 154 | +### 2. Shared Memory Layout |
| 155 | + |
| 156 | +``` |
| 157 | +┌─────────────────────────────────────────────────────────────┐ |
| 158 | +│ Shared Memory │ |
| 159 | +├─────────────────────────────────────────────────────────────┤ |
| 160 | +│ Worker State │ |
| 161 | +│ ├─ Status (IDLE/INIT/RUNNING/STOPPING/STOPPED) │ |
| 162 | +│ ├─ Node ID, Address, Port │ |
| 163 | +│ ├─ Cluster Name │ |
| 164 | +│ └─ Initialization Flags │ |
| 165 | +├─────────────────────────────────────────────────────────────┤ |
| 166 | +│ Command Queue (Circular Buffer) │ |
| 167 | +│ ├─ Command Type (INIT/ADD_NODE/REMOVE_NODE/LOG_APPEND) │ |
| 168 | +│ ├─ Command Data │ |
| 169 | +│ ├─ Timestamp │ |
| 170 | +│ └─ Queue Head/Tail Pointers │ |
| 171 | +├─────────────────────────────────────────────────────────────┤ |
| 172 | +│ Command Status FIFO │ |
| 173 | +│ ├─ Command ID │ |
| 174 | +│ ├─ Status (PENDING/PROCESSING/COMPLETED/FAILED) │ |
| 175 | +│ ├─ Error Message │ |
| 176 | +│ └─ Completion Time │ |
| 177 | +├─────────────────────────────────────────────────────────────┤ |
| 178 | +│ Cluster State │ |
| 179 | +│ ├─ Current Leader ID │ |
| 180 | +│ ├─ Current Term │ |
| 181 | +│ ├─ Node Membership │ |
| 182 | +│ └─ Log Statistics │ |
| 183 | +└─────────────────────────────────────────────────────────────┘ |
| 184 | +``` |
| 185 | + |
| 186 | +## Network Architecture |
| 187 | + |
| 188 | +### 1. Peer-to-Peer Communication |
| 189 | + |
| 190 | +``` |
| 191 | +┌─────────────┐ TCP ┌─────────────┐ TCP ┌─────────────┐ |
| 192 | +│ Node 1 │◄─────────►│ Node 2 │◄─────────►│ Node 3 │ |
| 193 | +│ │ │ │ │ │ |
| 194 | +│ Port: 5433 │ │ Port: 5434 │ │ Port: 5435 │ |
| 195 | +│ Raft Port: │ │ Raft Port: │ │ Raft Port: │ |
| 196 | +│ 8001 │ │ 8002 │ │ 8003 │ |
| 197 | +└─────────────┘ └─────────────┘ └─────────────┘ |
| 198 | +``` |
| 199 | + |
| 200 | +### 2. Message Types |
| 201 | + |
| 202 | +- **RequestVote**: Candidate requesting votes during elections |
| 203 | +- **AppendEntries**: Leader sending log entries and heartbeats |
| 204 | +- **InstallSnapshot**: Leader sending snapshot to catch up slow followers |
| 205 | +- **Heartbeat**: Regular leader-to-follower communication |
| 206 | + |
| 207 | +## Failure Scenarios and Recovery |
| 208 | + |
| 209 | +### 1. Leader Failure |
| 210 | + |
| 211 | +``` |
| 212 | +Normal Operation → Leader Fails → Election Timeout → New Election → New Leader |
| 213 | +``` |
| 214 | + |
| 215 | +### 2. Network Partition |
| 216 | + |
| 217 | +``` |
| 218 | +Full Connectivity → Network Split → Partition A (majority) → Partition B (minority) |
| 219 | + ↓ ↓ |
| 220 | + Continues Operation Stops Accepting Writes |
| 221 | +``` |
| 222 | + |
| 223 | +### 3. Node Recovery |
| 224 | + |
| 225 | +``` |
| 226 | +Node Down → Node Restarts → Joins Cluster → Catches Up Log → Active Participant |
| 227 | +``` |
| 228 | + |
| 229 | +## Security Considerations |
| 230 | + |
| 231 | +### 1. Network Security |
| 232 | +- TCP connections between peers |
| 233 | +- Configurable IP addresses and ports |
| 234 | +- No built-in encryption (relies on network-level security) |
| 235 | + |
| 236 | +### 2. Access Control |
| 237 | +- PostgreSQL's native authentication |
| 238 | +- Extension functions require appropriate privileges |
| 239 | +- Shared memory access controlled by PostgreSQL |
| 240 | + |
| 241 | +## Performance Characteristics |
| 242 | + |
| 243 | +### 1. Latency |
| 244 | +- Leader election: ~1-5 seconds (configurable) |
| 245 | +- Log replication: Network RTT + disk I/O |
| 246 | +- Heartbeat interval: 1 second (configurable) |
| 247 | + |
| 248 | +### 2. Throughput |
| 249 | +- Single leader handles all writes |
| 250 | +- Followers can serve read-only queries |
| 251 | +- Log replication limited by network bandwidth |
| 252 | + |
| 253 | +### 3. Scalability |
| 254 | +- Optimal with 3-5 nodes |
| 255 | +- More nodes increase election time |
| 256 | +- Network partitions affect availability |
| 257 | + |
| 258 | +## Configuration Parameters |
| 259 | + |
| 260 | +### 1. Network Settings |
| 261 | +- `pgraft.listen_address`: IP address to bind |
| 262 | +- `pgraft.listen_port`: Port for Raft communication |
| 263 | +- `pgraft.peer_timeout`: Network timeout for peer connections |
| 264 | + |
| 265 | +### 2. Raft Parameters |
| 266 | +- `pgraft.heartbeat_interval`: Heartbeat frequency (ms) |
| 267 | +- `pgraft.election_timeout`: Election timeout range (ms) |
| 268 | +- `pgraft.max_log_entries`: Maximum log entries per batch |
| 269 | + |
| 270 | +### 3. Operational Settings |
| 271 | +- `pgraft.cluster_name`: Unique cluster identifier |
| 272 | +- `pgraft.debug_enabled`: Enable debug logging |
| 273 | +- `pgraft.health_period_ms`: Health check frequency |
| 274 | + |
| 275 | +## Monitoring and Observability |
| 276 | + |
| 277 | +### 1. Cluster Health |
| 278 | +- Leader election status |
| 279 | +- Node membership status |
| 280 | +- Network connectivity |
| 281 | +- Log replication lag |
| 282 | + |
| 283 | +### 2. Performance Metrics |
| 284 | +- Command processing latency |
| 285 | +- Network message rates |
| 286 | +- Memory usage |
| 287 | +- Background worker status |
| 288 | + |
| 289 | +### 3. Logging |
| 290 | +- Raft protocol events |
| 291 | +- Network communication |
| 292 | +- Error conditions |
| 293 | +- Performance statistics |
| 294 | + |
| 295 | +## Deployment Considerations |
| 296 | + |
| 297 | +### 1. Hardware Requirements |
| 298 | +- Sufficient RAM for shared memory |
| 299 | +- Network bandwidth for replication |
| 300 | +- Disk I/O for log persistence |
| 301 | +- CPU for consensus processing |
| 302 | + |
| 303 | +### 2. Network Requirements |
| 304 | +- Low-latency network between nodes |
| 305 | +- Reliable network connectivity |
| 306 | +- Sufficient bandwidth for replication |
| 307 | +- Firewall configuration for peer ports |
| 308 | + |
| 309 | +### 3. PostgreSQL Configuration |
| 310 | +- Shared memory allocation |
| 311 | +- Background worker limits |
| 312 | +- Connection limits |
| 313 | +- Logging configuration |
| 314 | + |
| 315 | +This architecture provides a robust foundation for distributed PostgreSQL clusters with automatic failover, consistent replication, and high availability. |
0 commit comments