Skip to content

Commit be10315

Browse files
Issue (#2): Refector Code.
1 parent c3eeb1e commit be10315

10 files changed

Lines changed: 1882 additions & 556 deletions

File tree

pgraft/ARCHITECTURE.md

Lines changed: 315 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,315 @@
1+
# pgraft System Architecture
2+
3+
## Overview
4+
5+
pgraft implements a distributed consensus system using the Raft algorithm integrated with PostgreSQL. This document describes the overall system architecture, component interactions, and operational flows.
6+
7+
## System Components
8+
9+
### 1. PostgreSQL Cluster Nodes
10+
Each PostgreSQL instance in the cluster runs the pgraft extension and participates in the Raft consensus protocol.
11+
12+
### 2. Raft Consensus Layer
13+
The core consensus engine implemented in Go, providing:
14+
- Leader election
15+
- Log replication
16+
- Cluster membership management
17+
- Failure detection and recovery
18+
19+
### 3. Network Communication
20+
TCP-based peer-to-peer communication between cluster nodes for:
21+
- Raft protocol messages
22+
- Heartbeat signals
23+
- Log replication
24+
- Configuration changes
25+
26+
### 4. Shared Memory Interface
27+
PostgreSQL shared memory used for:
28+
- Command queue between SQL and background worker
29+
- Cluster state persistence
30+
- Worker status tracking
31+
- Command status monitoring
32+
33+
## High-Level Architecture
34+
35+
```
36+
┌─────────────────────────────────────────────────────────────────┐
37+
│ PostgreSQL Cluster │
38+
├─────────────────┬─────────────────┬─────────────────────────────┤
39+
│ Node 1 │ Node 2 │ Node 3 │
40+
│ │ │ │
41+
│ ┌─────────────┐ │ ┌─────────────┐ │ ┌─────────────┐ │
42+
│ │ PostgreSQL │ │ │ PostgreSQL │ │ │ PostgreSQL │ │
43+
│ │ Server │ │ │ Server │ │ │ Server │ │
44+
│ └─────────────┘ │ └─────────────┘ │ └─────────────┘ │
45+
│ │ │ │ │ │ │
46+
│ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐ │
47+
│ │ pgraft │ │ pgraft │ │ pgraft │ │
48+
│ │ Extension │ │ Extension │ │ Extension │ │
49+
│ └───────┬───────┼─────────┬───────┼─────────┬───────┘ │
50+
│ │ │ │ │ │ │
51+
│ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐ │
52+
│ │ Background │ │ Background │ │ Background │ │
53+
│ │ Worker │ │ Worker │ │ Worker │ │
54+
│ └───────┬───────┼─────────┬───────┼─────────┬───────┘ │
55+
│ │ │ │ │ │ │
56+
│ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐ │
57+
│ │ Go Raft │ │ Go Raft │ │ Go Raft │ │
58+
│ │ Library │ │ Library │ │ Library │ │
59+
│ └───────┬───────┼─────────┬───────┼─────────┬───────┘ │
60+
│ │ │ │ │ │ │
61+
└─────────┼───────┼─────────┼───────┼─────────┼───────────────────┘
62+
│ │ │ │ │
63+
└───────┼─────────┼───────┼─────────┘
64+
│ │ │
65+
┌───────▼─────────▼───────▼───────┐
66+
│ Network Layer │
67+
│ (TCP Peer Communication) │
68+
└─────────────────────────────────┘
69+
```
70+
71+
## Component Interaction Flow
72+
73+
### 1. Cluster Initialization
74+
75+
```mermaid
76+
sequenceDiagram
77+
participant U as User
78+
participant N1 as Node 1
79+
participant N2 as Node 2
80+
participant N3 as Node 3
81+
82+
U->>N1: SELECT pgraft_init()
83+
N1->>N1: Start background worker
84+
N1->>N1: Initialize Raft node
85+
N1->>N1: Start network server
86+
87+
U->>N2: SELECT pgraft_add_node('node2:5433')
88+
N1->>N2: Connect to node 2
89+
N2->>N2: Start background worker
90+
N2->>N2: Join cluster
91+
92+
U->>N3: SELECT pgraft_add_node('node3:5433')
93+
N1->>N3: Connect to node 3
94+
N3->>N3: Start background worker
95+
N3->>N3: Join cluster
96+
97+
Note over N1,N3: Cluster formed with 3 nodes
98+
```
99+
100+
### 2. Leader Election Process
101+
102+
```mermaid
103+
sequenceDiagram
104+
participant N1 as Node 1 (Leader)
105+
participant N2 as Node 2 (Follower)
106+
participant N3 as Node 3 (Follower)
107+
108+
loop Heartbeat
109+
N1->>N2: AppendEntries (heartbeat)
110+
N1->>N3: AppendEntries (heartbeat)
111+
N2->>N1: AppendEntries Response
112+
N3->>N1: AppendEntries Response
113+
end
114+
115+
Note over N1: Leader fails
116+
N2->>N2: Election timeout
117+
N2->>N3: RequestVote
118+
N3->>N2: Vote granted
119+
N2->>N2: Become leader
120+
N2->>N3: AppendEntries (heartbeat)
121+
```
122+
123+
### 3. Log Replication
124+
125+
```mermaid
126+
sequenceDiagram
127+
participant U as User
128+
participant L as Leader
129+
participant F1 as Follower 1
130+
participant F2 as Follower 2
131+
132+
U->>L: INSERT/UPDATE/DELETE
133+
L->>L: Append to log
134+
L->>F1: AppendEntries (log entry)
135+
L->>F2: AppendEntries (log entry)
136+
F1->>L: AppendEntries Response
137+
F2->>L: AppendEntries Response
138+
L->>L: Commit entry
139+
L->>F1: AppendEntries (commit)
140+
L->>F2: AppendEntries (commit)
141+
L->>U: Transaction committed
142+
```
143+
144+
## Data Flow Architecture
145+
146+
### 1. Command Processing Flow
147+
148+
```
149+
SQL Function → Command Queue → Background Worker → Go Raft Library → Network
150+
↑ ↓
151+
└─────────── Command Status ← Shared Memory ← Raft State ←──────────┘
152+
```
153+
154+
### 2. Shared Memory Layout
155+
156+
```
157+
┌─────────────────────────────────────────────────────────────┐
158+
│ Shared Memory │
159+
├─────────────────────────────────────────────────────────────┤
160+
│ Worker State │
161+
│ ├─ Status (IDLE/INIT/RUNNING/STOPPING/STOPPED) │
162+
│ ├─ Node ID, Address, Port │
163+
│ ├─ Cluster Name │
164+
│ └─ Initialization Flags │
165+
├─────────────────────────────────────────────────────────────┤
166+
│ Command Queue (Circular Buffer) │
167+
│ ├─ Command Type (INIT/ADD_NODE/REMOVE_NODE/LOG_APPEND) │
168+
│ ├─ Command Data │
169+
│ ├─ Timestamp │
170+
│ └─ Queue Head/Tail Pointers │
171+
├─────────────────────────────────────────────────────────────┤
172+
│ Command Status FIFO │
173+
│ ├─ Command ID │
174+
│ ├─ Status (PENDING/PROCESSING/COMPLETED/FAILED) │
175+
│ ├─ Error Message │
176+
│ └─ Completion Time │
177+
├─────────────────────────────────────────────────────────────┤
178+
│ Cluster State │
179+
│ ├─ Current Leader ID │
180+
│ ├─ Current Term │
181+
│ ├─ Node Membership │
182+
│ └─ Log Statistics │
183+
└─────────────────────────────────────────────────────────────┘
184+
```
185+
186+
## Network Architecture
187+
188+
### 1. Peer-to-Peer Communication
189+
190+
```
191+
┌─────────────┐ TCP ┌─────────────┐ TCP ┌─────────────┐
192+
│ Node 1 │◄─────────►│ Node 2 │◄─────────►│ Node 3 │
193+
│ │ │ │ │ │
194+
│ Port: 5433 │ │ Port: 5434 │ │ Port: 5435 │
195+
│ Raft Port: │ │ Raft Port: │ │ Raft Port: │
196+
│ 8001 │ │ 8002 │ │ 8003 │
197+
└─────────────┘ └─────────────┘ └─────────────┘
198+
```
199+
200+
### 2. Message Types
201+
202+
- **RequestVote**: Candidate requesting votes during elections
203+
- **AppendEntries**: Leader sending log entries and heartbeats
204+
- **InstallSnapshot**: Leader sending snapshot to catch up slow followers
205+
- **Heartbeat**: Regular leader-to-follower communication
206+
207+
## Failure Scenarios and Recovery
208+
209+
### 1. Leader Failure
210+
211+
```
212+
Normal Operation → Leader Fails → Election Timeout → New Election → New Leader
213+
```
214+
215+
### 2. Network Partition
216+
217+
```
218+
Full Connectivity → Network Split → Partition A (majority) → Partition B (minority)
219+
↓ ↓
220+
Continues Operation Stops Accepting Writes
221+
```
222+
223+
### 3. Node Recovery
224+
225+
```
226+
Node Down → Node Restarts → Joins Cluster → Catches Up Log → Active Participant
227+
```
228+
229+
## Security Considerations
230+
231+
### 1. Network Security
232+
- TCP connections between peers
233+
- Configurable IP addresses and ports
234+
- No built-in encryption (relies on network-level security)
235+
236+
### 2. Access Control
237+
- PostgreSQL's native authentication
238+
- Extension functions require appropriate privileges
239+
- Shared memory access controlled by PostgreSQL
240+
241+
## Performance Characteristics
242+
243+
### 1. Latency
244+
- Leader election: ~1-5 seconds (configurable)
245+
- Log replication: Network RTT + disk I/O
246+
- Heartbeat interval: 1 second (configurable)
247+
248+
### 2. Throughput
249+
- Single leader handles all writes
250+
- Followers can serve read-only queries
251+
- Log replication limited by network bandwidth
252+
253+
### 3. Scalability
254+
- Optimal with 3-5 nodes
255+
- More nodes increase election time
256+
- Network partitions affect availability
257+
258+
## Configuration Parameters
259+
260+
### 1. Network Settings
261+
- `pgraft.listen_address`: IP address to bind
262+
- `pgraft.listen_port`: Port for Raft communication
263+
- `pgraft.peer_timeout`: Network timeout for peer connections
264+
265+
### 2. Raft Parameters
266+
- `pgraft.heartbeat_interval`: Heartbeat frequency (ms)
267+
- `pgraft.election_timeout`: Election timeout range (ms)
268+
- `pgraft.max_log_entries`: Maximum log entries per batch
269+
270+
### 3. Operational Settings
271+
- `pgraft.cluster_name`: Unique cluster identifier
272+
- `pgraft.debug_enabled`: Enable debug logging
273+
- `pgraft.health_period_ms`: Health check frequency
274+
275+
## Monitoring and Observability
276+
277+
### 1. Cluster Health
278+
- Leader election status
279+
- Node membership status
280+
- Network connectivity
281+
- Log replication lag
282+
283+
### 2. Performance Metrics
284+
- Command processing latency
285+
- Network message rates
286+
- Memory usage
287+
- Background worker status
288+
289+
### 3. Logging
290+
- Raft protocol events
291+
- Network communication
292+
- Error conditions
293+
- Performance statistics
294+
295+
## Deployment Considerations
296+
297+
### 1. Hardware Requirements
298+
- Sufficient RAM for shared memory
299+
- Network bandwidth for replication
300+
- Disk I/O for log persistence
301+
- CPU for consensus processing
302+
303+
### 2. Network Requirements
304+
- Low-latency network between nodes
305+
- Reliable network connectivity
306+
- Sufficient bandwidth for replication
307+
- Firewall configuration for peer ports
308+
309+
### 3. PostgreSQL Configuration
310+
- Shared memory allocation
311+
- Background worker limits
312+
- Connection limits
313+
- Logging configuration
314+
315+
This architecture provides a robust foundation for distributed PostgreSQL clusters with automatic failover, consistent replication, and high availability.

0 commit comments

Comments
 (0)