Skip to content

Commit 516b533

Browse files
auricomclaude
andcommitted
docs(ha): fix raft.peers self-inclusion startup bug
The abbreviated node-2 snippet with "# peers list is identical" caused a startup failure: with raft_addr=0.0.0.0:5001 the bootstrap code's literal address comparison does not recognise node-2@10.0.0.2:5001 as self, so node-2 is appended twice and deduplicateServers returns "duplicate peers found in config". - Fix intro text: "only raft.node_id and raft_addr differ" → "raft.node_id is unique; raft.peers and p2p.peers must exclude self" - Expand node-2 snippet to a full evnode.yaml with the correct peers list (node-1, node-3, node-4, node-5 — no node-2) and an inline explanation of the wildcard address pitfall - Align overview.md trailing_logs example to 1 block/s (matching block_time: "1s" used throughout) and note the 10 block/s rate too Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 0b5aa74 commit 516b533

2 files changed

Lines changed: 23 additions & 6 deletions

File tree

docs/guides/ha/cluster-setup.md

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ scp ~/.evm/config/genesis.json user@10.0.0.5:~/.evm/config/
120120

121121
## Step 5: Write the Configuration Files
122122

123-
Write the following `evnode.yaml` on each node. The only field that differs per node is `raft.node_id` and `raft.raft_addr` — everything else is identical.
123+
Write the following `evnode.yaml` on each node. `raft.node_id` is unique per node; `raft.peers` and `p2p.peers` must each exclude the local node — everything else is identical.
124124

125125
### node-1 (`~/.evm/config/evnode.yaml`)
126126

@@ -154,20 +154,37 @@ p2p:
154154
155155
### node-2 (`~/.evm/config/evnode.yaml`)
156156

157-
Each node's `raft.peers` must list every **other** node — never the node itself.
157+
`raft.peers` must omit the local node. Because `raft_addr` is `0.0.0.0:5001` (a wildcard), the self-exclusion check in the bootstrap code compares addresses literally — it will not recognise `node-2@10.0.0.2:5001` as itself and will add node-2 twice, causing a startup error. Always list only the **other** nodes.
158158

159159
```yaml
160-
# ... same as node-1 except node_id and raft.peers:
160+
node:
161+
aggregator: true
162+
block_time: "1s"
163+
161164
raft:
165+
enable: true
162166
node_id: "node-2"
163167
raft_addr: "0.0.0.0:5001"
168+
raft_dir: "/var/lib/ev-node/raft"
164169
peers: "node-1@10.0.0.1:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001"
165170
171+
# Timing — tuned for RTT_MAX ≤ 25ms
172+
heartbeat_timeout: "92ms"
173+
election_timeout: "368ms"
174+
leader_lease_timeout: "46ms"
175+
send_timeout: "50ms"
176+
177+
# Log retention — covers ~5 hours of absence at 1 block/s
178+
trailing_logs: 18000
179+
snapshot_threshold: 5000
180+
snap_count: 3
181+
166182
p2p:
183+
listen_address: "/ip4/0.0.0.0/tcp/26656"
167184
peers: "/ip4/10.0.0.1/tcp/26656/p2p/<PEER_ID_NODE_1>,/ip4/10.0.0.3/tcp/26656/p2p/<PEER_ID_NODE_3>,/ip4/10.0.0.4/tcp/26656/p2p/<PEER_ID_NODE_4>,/ip4/10.0.0.5/tcp/26656/p2p/<PEER_ID_NODE_5>"
168185
```
169186

170-
Repeat for node-3 through node-5: increment `node_id`, remove the local node from both `raft.peers` and `p2p.peers`.
187+
Repeat for node-3 through node-5, updating `node_id`, `raft.peers` (exclude the local node), and `p2p.peers` (exclude the local node).
171188

172189
---
173190

docs/guides/ha/overview.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -307,9 +307,9 @@ The number of log entries to **retain after a snapshot** is taken. These entries
307307

308308
**Effect on operations:**
309309
- **Lower values** (e.g., `200`): tighter disk usage; a node that misses even a few minutes of operation must receive a full snapshot on rejoin.
310-
- **Higher values** (e.g., `18000`): a lagging node can catch up via log replay for up to 30 minutes at 10 block/second without needing a full snapshot transfer, reducing the cost of brief outages.
310+
- **Higher values** (e.g., `18000`): a lagging node can catch up via log replay without needing a full snapshot transfer, reducing the cost of brief outages. At 1 block/second (`block_time: "1s"`), `trailing_logs: 18000` covers ~5 hours; at 10 block/second, ~30 minutes.
311311

312-
Set this high enough to cover your typical maintenance window (restart, upgrade, brief network partition). At 10 block/second, `trailing_logs: 18000` covers 30 minutes of absence (1800 seconds).
312+
Set this high enough to cover your typical maintenance window (restart, upgrade, brief network partition). Scale proportionally with your chain's block rate.
313313

314314
---
315315

0 commit comments

Comments
 (0)