docs(ha): address PR review feedback

auricom · claude · auricom · commit 409e21809192 · 2026-04-28T14:15:18.000+02:00
Critical fixes:
- Fix snapshot_threshold math: 5000 ÷ 10 = 500s ≈ 8.3 min (not 83s)
- Fix trailing_logs math: 18000 ÷ 10 = 1800s = 30 min (not 5 min)

Medium fixes:
- Fix heartbeat_timeout description: it is a follower-side election trigger,
  not the interval at which the leader sends heartbeats
- Add explicit restart instruction after Step 5 data copy in single-to-ha.md
  so the chain keeps producing blocks during preparation (Steps 6-8)
- Replace priv_validator_key.json with signer.json in single-to-ha.md
  to match cluster-setup.md and the E2E tests

Minor fixes:
- Exclude self from raft.peers in all examples (cluster-setup.md node-1
  yaml/CLI/systemd, single-to-ha.md node-1 and node-2)
- Add "exclude local node" note to raft.peers description in overview.md
- Fix P2P port in overview.md Interaction with P2P section (7676 → 26656)
- Add text language tag to all bare fenced blocks (MD040): multiaddr
  example, RTT equations, and all log snippets

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/docs/guides/ha/cluster-setup.md b/docs/guides/ha/cluster-setup.md
@@ -26,7 +26,7 @@ Replace these with your actual IP addresses throughout the guide.
 
 P2P peers use the libp2p multiaddr format, which includes each node's peer ID:
 
-```
+```text
 /ip4/<ip>/tcp/<port>/p2p/<peer-id>
 ```
 
@@ -132,7 +132,7 @@ raft:
   node_id: "node-1"
   raft_addr: "0.0.0.0:5001"
   raft_dir: "/var/lib/ev-node/raft"
-  peers: "node-1@10.0.0.1:5001,node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001"
+  peers: "node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001"
 
   # Timing — tuned for RTT_MAX ≤ 25ms
   heartbeat_timeout:    "92ms"
@@ -194,7 +194,7 @@ Start all five nodes as close together as possible. The order does not matter bu
   --evnode.raft.node_id="node-1" \
   --evnode.raft.raft_addr="0.0.0.0:5001" \
   --evnode.raft.raft_dir="/var/lib/ev-node/raft" \
-  --evnode.raft.peers="node-1@10.0.0.1:5001,node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001" \
+  --evnode.raft.peers="node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001" \
   --evnode.raft.heartbeat_timeout="92ms" \
   --evnode.raft.election_timeout="368ms" \
   --evnode.raft.leader_lease_timeout="46ms" \
@@ -218,7 +218,7 @@ Adjust flags for your execution layer (e.g., remove EVM flags if you are running
 
 Within a few seconds of starting, you should see one node win the election:
 
-```
+```text
 INF raft: entering candidate state  node=node-1
 INF raft: election won               tally=3
 INF raft: entering leader state      leader=node-1
@@ -227,7 +227,7 @@ INF block produced                   height=1 hash=0xabc...
 
 The other nodes will log:
 
-```
+```text
 INF raft: entering follower state  leader=node-1
 INF block applied from raft log    height=1 hash=0xabc...
 ```
@@ -260,8 +260,14 @@ done
 With all five nodes running and producing blocks, simulate a leader failure:
 
 ```bash
-# Identify the current leader from its logs, then on that machine:
-kill -SIGTERM $(pgrep evm)
+# Identify the current leader from its logs, then on that machine.
+# Preferred: use the systemd unit if ev-node runs as a service
+sudo systemctl stop ev-node
+
+# Fallback: stop the process directly (verify exactly one PID before killing)
+PID=$(pgrep -f "evm start")
+echo "Stopping PID $PID"
+kill -SIGTERM "$PID"
 ```
 
 Within `election_timeout` (368ms in this configuration), the remaining four nodes will elect a new leader and resume block production. Measure the actual gap in your logs:
@@ -294,7 +300,7 @@ ExecStart=/usr/local/bin/evm start \
   --evnode.raft.node_id=node-1 \
   --evnode.raft.raft_addr=0.0.0.0:5001 \
   --evnode.raft.raft_dir=/var/lib/ev-node/raft \
-  --evnode.raft.peers=node-1@10.0.0.1:5001,node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001 \
+  --evnode.raft.peers=node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001 \
   --evnode.raft.heartbeat_timeout=92ms \
   --evnode.raft.election_timeout=368ms \
   --evnode.raft.leader_lease_timeout=46ms \
diff --git a/docs/guides/ha/overview.md b/docs/guides/ha/overview.md
@@ -151,7 +151,7 @@ raft:
 **CLI:** `--evnode.raft.peers`  
 **Default:** _(none, required)_
 
-A comma-separated list of **all** remote cluster members, in the format `nodeID@host:port`. The host and port must be the Raft address (`raft_addr`) of each peer as reachable from this node.
+A comma-separated list of the **other** cluster members (exclude the local node), in the format `nodeID@host:port`. The host and port must be the Raft address (`raft_addr`) of each peer as reachable from this node. Do not list the node's own `node_id` in its own `peers` field.
 
  Raft uses this list to:
 - Bootstrap the cluster on first start (when no persisted state exists).
@@ -182,7 +182,7 @@ raft:
 **CLI:** `--evnode.raft.heartbeat_timeout`  
 **Default:** `350ms`
 
-How often the leader sends heartbeat messages to followers. Followers that do not receive a heartbeat within this interval begin a new election.
+The maximum time a follower will wait without receiving a heartbeat from the leader before starting a new election. The leader sends heartbeats more frequently than this value internally; this parameter is purely a follower-side timeout that triggers a new election when crossed.
 
 **Tuning rule:** Set to **4–5× RTT_MAX**. This ensures followers can distinguish a slow network from a dead leader without triggering spurious elections.
 
@@ -268,7 +268,7 @@ The number of committed log entries that must accumulate before Raft automatical
 - **Lower values** (e.g., `500`): snapshots are taken frequently, keeping the log small. A restarting node receives a recent snapshot and has fewer log entries to replay, but snapshot writes happen more often, adding brief I/O bursts.
 - **Higher values** (e.g., `5000`): less frequent snapshots mean less I/O overhead during normal operation, but a lagging node may have more log entries to replay when catching up.
 
-At 10 block/second, `snapshot_threshold: 5000` takes a snapshot roughly every 83 seconds.
+At 10 block/second, `snapshot_threshold: 5000` takes a snapshot roughly every 8.3 minutes (500 seconds).
 
 ---
 
@@ -286,9 +286,9 @@ The number of log entries to **retain after a snapshot** is taken. These entries
 
 **Effect on operations:**
 - **Lower values** (e.g., `200`): tighter disk usage; a node that misses even a few minutes of operation must receive a full snapshot on rejoin.
-- **Higher values** (e.g., `18000`): a lagging node can catch up via log replay for up to 5 minutes at 10 block/second without needing a full snapshot transfer, reducing the cost of brief outages.
+- **Higher values** (e.g., `18000`): a lagging node can catch up via log replay for up to 30 minutes at 10 block/second without needing a full snapshot transfer, reducing the cost of brief outages.
 
-Set this high enough to cover your typical maintenance window (restart, upgrade, brief network partition). At 10 block/second, `trailing_logs: 18000` covers 5 minutes of absence.
+Set this high enough to cover your typical maintenance window (restart, upgrade, brief network partition). At 10 block/second, `trailing_logs: 18000` covers 30 minutes of absence (1800 seconds).
 
 ---
 
@@ -346,7 +346,7 @@ raft:
 
 Measure RTT_MAX first and scale the timing parameters:
 
-```
+```text
 heartbeat_timeout    = RTT_MAX × 4
 election_timeout     = heartbeat_timeout × 4
 leader_lease_timeout = heartbeat_timeout / 2
@@ -361,8 +361,8 @@ Even in a Raft cluster, each node must have P2P configured. Raft handles **hot r
 
 ```yaml
 p2p:
-  listen_address: "/ip4/0.0.0.0/tcp/7676"
-  peers: "/ip4/<PEER_IP>/tcp/<PEER_TCP>/p2p/<PEER_ID>,..."
+  listen_address: "/ip4/0.0.0.0/tcp/26656"
+  peers: "/ip4/<PEER_IP>/tcp/26656/p2p/<PEER_ID>,..."
 ```
 
 Ensure P2P ports are open between nodes in addition to the Raft port.
diff --git a/docs/guides/ha/single-to-ha.md b/docs/guides/ha/single-to-ha.md
@@ -82,10 +82,10 @@ All five nodes must sign blocks with the **same key**. The existing sequencer's
 ls ~/.evm/config/
 
 # Copy to each new node
-scp ~/.evm/config/priv_validator_key.json user@10.0.0.2:~/.evm/config/
-scp ~/.evm/config/priv_validator_key.json user@10.0.0.3:~/.evm/config/
-scp ~/.evm/config/priv_validator_key.json user@10.0.0.4:~/.evm/config/
-scp ~/.evm/config/priv_validator_key.json user@10.0.0.5:~/.evm/config/
+scp ~/.evm/config/signer.json user@10.0.0.2:~/.evm/config/
+scp ~/.evm/config/signer.json user@10.0.0.3:~/.evm/config/
+scp ~/.evm/config/signer.json user@10.0.0.4:~/.evm/config/
+scp ~/.evm/config/signer.json user@10.0.0.5:~/.evm/config/
 ```
 
 ---
@@ -126,6 +126,13 @@ After the copy, note the **latest block height** — this is your reference poin
 cast block --rpc-url http://<EV_RETH_IP>:<EV_RETH_TCP>
 ```
 
+**Restart the existing sequencer now** so the chain keeps producing blocks while you prepare the remaining nodes (Steps 6–8). The chain will run uninterrupted until the planned cutover in Step 9.
+
+```bash
+# On node-1 — restart with your original single-sequencer flags
+systemctl start ev-node
+```
+
 ---
 
 ## Step 6: Collect Peer IDs
@@ -192,7 +199,7 @@ raft:
   node_id: "node-2"      # change per node
   raft_addr: "0.0.0.0:5001"
   raft_dir: "/var/lib/ev-node/raft"
-  peers: "node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001"
+  peers: "node-1@10.0.0.1:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001"
   heartbeat_timeout:    "92ms"
   election_timeout:     "368ms"
   leader_lease_timeout: "46ms"
@@ -250,7 +257,7 @@ Use a coordination mechanism — a simple approach is to open five terminals (or
   --evnode.raft.node_id="node-1" \
   --evnode.raft.raft_addr="0.0.0.0:5001" \
   --evnode.raft.raft_dir="/var/lib/ev-node/raft" \
-  --evnode.raft.peers="node-1@10.0.0.1:5001,node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001" \
+  --evnode.raft.peers="node-2@10.0.0.2:5001,node-3@10.0.0.3:5001,node-4@10.0.0.4:5001,node-5@10.0.0.5:5001" \
   --evnode.raft.heartbeat_timeout="92ms" \
   --evnode.raft.election_timeout="368ms" \
   --evnode.raft.leader_lease_timeout="46ms" \
@@ -281,7 +288,7 @@ Repeat for node-3, node-4, node-5.
 
 Within seconds of starting, one node will win the election. Look for:
 
-```
+```text
 INF raft: election won  tally=3  leader=node-1
 INF raft: entering leader state
 INF block produced  height=<N+1>
@@ -291,7 +298,7 @@ where `N` is the last block produced by the old single sequencer.
 
 The followers will show:
 
-```
+```text
 INF raft: entering follower state  leader=node-1
 INF block applied from raft log  height=<N+1>
 ```