Skip to content

Commit 409e218

Browse files
auricomclaude
andcommitted
docs(ha): address PR review feedback
Critical fixes: - Fix snapshot_threshold math: 5000 ÷ 10 = 500s ≈ 8.3 min (not 83s) - Fix trailing_logs math: 18000 ÷ 10 = 1800s = 30 min (not 5 min) Medium fixes: - Fix heartbeat_timeout description: it is a follower-side election trigger, not the interval at which the leader sends heartbeats - Add explicit restart instruction after Step 5 data copy in single-to-ha.md so the chain keeps producing blocks during preparation (Steps 6-8) - Replace priv_validator_key.json with signer.json in single-to-ha.md to match cluster-setup.md and the E2E tests Minor fixes: - Exclude self from raft.peers in all examples (cluster-setup.md node-1 yaml/CLI/systemd, single-to-ha.md node-1 and node-2) - Add "exclude local node" note to raft.peers description in overview.md - Fix P2P port in overview.md Interaction with P2P section (7676 → 26656) - Add text language tag to all bare fenced blocks (MD040): multiaddr example, RTT equations, and all log snippets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 9e05891 commit 409e218

3 files changed

Lines changed: 37 additions & 24 deletions

File tree

docs/guides/ha/cluster-setup.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Replace these with your actual IP addresses throughout the guide.
2626

2727
P2P peers use the libp2p multiaddr format, which includes each node's peer ID:
2828

29-
```
29+
```text
3030
/ip4/<ip>/tcp/<port>/p2p/<peer-id>
3131
```
3232

@@ -132,7 +132,7 @@ raft:
132132
node_id: "node-1"
133133
raft_addr: "0.0.0.0:5001"
134134
raft_dir: "/var/lib/ev-node/raft"
135-
peers: "node-1@10.0.0.1:5001,node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001"
135+
peers: "node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001"
136136

137137
# Timing — tuned for RTT_MAX ≤ 25ms
138138
heartbeat_timeout: "92ms"
@@ -194,7 +194,7 @@ Start all five nodes as close together as possible. The order does not matter bu
194194
--evnode.raft.node_id="node-1" \
195195
--evnode.raft.raft_addr="0.0.0.0:5001" \
196196
--evnode.raft.raft_dir="/var/lib/ev-node/raft" \
197-
--evnode.raft.peers="node-1@10.0.0.1:5001,node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001" \
197+
--evnode.raft.peers="node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001" \
198198
--evnode.raft.heartbeat_timeout="92ms" \
199199
--evnode.raft.election_timeout="368ms" \
200200
--evnode.raft.leader_lease_timeout="46ms" \
@@ -218,7 +218,7 @@ Adjust flags for your execution layer (e.g., remove EVM flags if you are running
218218

219219
Within a few seconds of starting, you should see one node win the election:
220220

221-
```
221+
```text
222222
INF raft: entering candidate state node=node-1
223223
INF raft: election won tally=3
224224
INF raft: entering leader state leader=node-1
@@ -227,7 +227,7 @@ INF block produced height=1 hash=0xabc...
227227

228228
The other nodes will log:
229229

230-
```
230+
```text
231231
INF raft: entering follower state leader=node-1
232232
INF block applied from raft log height=1 hash=0xabc...
233233
```
@@ -260,8 +260,14 @@ done
260260
With all five nodes running and producing blocks, simulate a leader failure:
261261

262262
```bash
263-
# Identify the current leader from its logs, then on that machine:
264-
kill -SIGTERM $(pgrep evm)
263+
# Identify the current leader from its logs, then on that machine.
264+
# Preferred: use the systemd unit if ev-node runs as a service
265+
sudo systemctl stop ev-node
266+
267+
# Fallback: stop the process directly (verify exactly one PID before killing)
268+
PID=$(pgrep -f "evm start")
269+
echo "Stopping PID $PID"
270+
kill -SIGTERM "$PID"
265271
```
266272

267273
Within `election_timeout` (368ms in this configuration), the remaining four nodes will elect a new leader and resume block production. Measure the actual gap in your logs:
@@ -294,7 +300,7 @@ ExecStart=/usr/local/bin/evm start \
294300
--evnode.raft.node_id=node-1 \
295301
--evnode.raft.raft_addr=0.0.0.0:5001 \
296302
--evnode.raft.raft_dir=/var/lib/ev-node/raft \
297-
--evnode.raft.peers=node-1@10.0.0.1:5001,node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001 \
303+
--evnode.raft.peers=node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001 \
298304
--evnode.raft.heartbeat_timeout=92ms \
299305
--evnode.raft.election_timeout=368ms \
300306
--evnode.raft.leader_lease_timeout=46ms \

docs/guides/ha/overview.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ raft:
151151
**CLI:** `--evnode.raft.peers`
152152
**Default:** _(none, required)_
153153

154-
A comma-separated list of **all** remote cluster members, in the format `nodeID@host:port`. The host and port must be the Raft address (`raft_addr`) of each peer as reachable from this node.
154+
A comma-separated list of the **other** cluster members (exclude the local node), in the format `nodeID@host:port`. The host and port must be the Raft address (`raft_addr`) of each peer as reachable from this node. Do not list the node's own `node_id` in its own `peers` field.
155155

156156
Raft uses this list to:
157157
- Bootstrap the cluster on first start (when no persisted state exists).
@@ -182,7 +182,7 @@ raft:
182182
**CLI:** `--evnode.raft.heartbeat_timeout`
183183
**Default:** `350ms`
184184

185-
How often the leader sends heartbeat messages to followers. Followers that do not receive a heartbeat within this interval begin a new election.
185+
The maximum time a follower will wait without receiving a heartbeat from the leader before starting a new election. The leader sends heartbeats more frequently than this value internally; this parameter is purely a follower-side timeout that triggers a new election when crossed.
186186

187187
**Tuning rule:** Set to **4–5× RTT_MAX**. This ensures followers can distinguish a slow network from a dead leader without triggering spurious elections.
188188

@@ -268,7 +268,7 @@ The number of committed log entries that must accumulate before Raft automatical
268268
- **Lower values** (e.g., `500`): snapshots are taken frequently, keeping the log small. A restarting node receives a recent snapshot and has fewer log entries to replay, but snapshot writes happen more often, adding brief I/O bursts.
269269
- **Higher values** (e.g., `5000`): less frequent snapshots mean less I/O overhead during normal operation, but a lagging node may have more log entries to replay when catching up.
270270

271-
At 10 block/second, `snapshot_threshold: 5000` takes a snapshot roughly every 83 seconds.
271+
At 10 block/second, `snapshot_threshold: 5000` takes a snapshot roughly every 8.3 minutes (500 seconds).
272272

273273
---
274274

@@ -286,9 +286,9 @@ The number of log entries to **retain after a snapshot** is taken. These entries
286286

287287
**Effect on operations:**
288288
- **Lower values** (e.g., `200`): tighter disk usage; a node that misses even a few minutes of operation must receive a full snapshot on rejoin.
289-
- **Higher values** (e.g., `18000`): a lagging node can catch up via log replay for up to 5 minutes at 10 block/second without needing a full snapshot transfer, reducing the cost of brief outages.
289+
- **Higher values** (e.g., `18000`): a lagging node can catch up via log replay for up to 30 minutes at 10 block/second without needing a full snapshot transfer, reducing the cost of brief outages.
290290

291-
Set this high enough to cover your typical maintenance window (restart, upgrade, brief network partition). At 10 block/second, `trailing_logs: 18000` covers 5 minutes of absence.
291+
Set this high enough to cover your typical maintenance window (restart, upgrade, brief network partition). At 10 block/second, `trailing_logs: 18000` covers 30 minutes of absence (1800 seconds).
292292

293293
---
294294

@@ -346,7 +346,7 @@ raft:
346346

347347
Measure RTT_MAX first and scale the timing parameters:
348348

349-
```
349+
```text
350350
heartbeat_timeout = RTT_MAX × 4
351351
election_timeout = heartbeat_timeout × 4
352352
leader_lease_timeout = heartbeat_timeout / 2
@@ -361,8 +361,8 @@ Even in a Raft cluster, each node must have P2P configured. Raft handles **hot r
361361

362362
```yaml
363363
p2p:
364-
listen_address: "/ip4/0.0.0.0/tcp/7676"
365-
peers: "/ip4/<PEER_IP>/tcp/<PEER_TCP>/p2p/<PEER_ID>,..."
364+
listen_address: "/ip4/0.0.0.0/tcp/26656"
365+
peers: "/ip4/<PEER_IP>/tcp/26656/p2p/<PEER_ID>,..."
366366
```
367367

368368
Ensure P2P ports are open between nodes in addition to the Raft port.

docs/guides/ha/single-to-ha.md

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -82,10 +82,10 @@ All five nodes must sign blocks with the **same key**. The existing sequencer's
8282
ls ~/.evm/config/
8383

8484
# Copy to each new node
85-
scp ~/.evm/config/priv_validator_key.json user@10.0.0.2:~/.evm/config/
86-
scp ~/.evm/config/priv_validator_key.json user@10.0.0.3:~/.evm/config/
87-
scp ~/.evm/config/priv_validator_key.json user@10.0.0.4:~/.evm/config/
88-
scp ~/.evm/config/priv_validator_key.json user@10.0.0.5:~/.evm/config/
85+
scp ~/.evm/config/signer.json user@10.0.0.2:~/.evm/config/
86+
scp ~/.evm/config/signer.json user@10.0.0.3:~/.evm/config/
87+
scp ~/.evm/config/signer.json user@10.0.0.4:~/.evm/config/
88+
scp ~/.evm/config/signer.json user@10.0.0.5:~/.evm/config/
8989
```
9090

9191
---
@@ -126,6 +126,13 @@ After the copy, note the **latest block height** — this is your reference poin
126126
cast block --rpc-url http://<EV_RETH_IP>:<EV_RETH_TCP>
127127
```
128128

129+
**Restart the existing sequencer now** so the chain keeps producing blocks while you prepare the remaining nodes (Steps 6–8). The chain will run uninterrupted until the planned cutover in Step 9.
130+
131+
```bash
132+
# On node-1 — restart with your original single-sequencer flags
133+
systemctl start ev-node
134+
```
135+
129136
---
130137

131138
## Step 6: Collect Peer IDs
@@ -192,7 +199,7 @@ raft:
192199
node_id: "node-2" # change per node
193200
raft_addr: "0.0.0.0:5001"
194201
raft_dir: "/var/lib/ev-node/raft"
195-
peers: "node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001"
202+
peers: "node-1@10.0.0.1:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001"
196203
heartbeat_timeout: "92ms"
197204
election_timeout: "368ms"
198205
leader_lease_timeout: "46ms"
@@ -250,7 +257,7 @@ Use a coordination mechanism — a simple approach is to open five terminals (or
250257
--evnode.raft.node_id="node-1" \
251258
--evnode.raft.raft_addr="0.0.0.0:5001" \
252259
--evnode.raft.raft_dir="/var/lib/ev-node/raft" \
253-
--evnode.raft.peers="node-1@10.0.0.1:5001,node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001" \
260+
--evnode.raft.peers="node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001" \
254261
--evnode.raft.heartbeat_timeout="92ms" \
255262
--evnode.raft.election_timeout="368ms" \
256263
--evnode.raft.leader_lease_timeout="46ms" \
@@ -281,7 +288,7 @@ Repeat for node-3, node-4, node-5.
281288

282289
Within seconds of starting, one node will win the election. Look for:
283290

284-
```
291+
```text
285292
INF raft: election won tally=3 leader=node-1
286293
INF raft: entering leader state
287294
INF block produced height=<N+1>
@@ -291,7 +298,7 @@ where `N` is the last block produced by the old single sequencer.
291298

292299
The followers will show:
293300

294-
```
301+
```text
295302
INF raft: entering follower state leader=node-1
296303
INF block applied from raft log height=<N+1>
297304
```

0 commit comments

Comments
 (0)