Describe the bug
When Zero is restarted with a different --my address (e.g., migrating from localhost to a production FQDN), Zero continues to advertise the old address to Alphas via its membership state. This is because the original address is stored in a Raft conf change entry in Zero's WAL and replayed on every restart, with no code to reconcile it against the current --my value.
Alpha uses the --zero flag only as a bootstrap address for the initial connection. After connecting, Alpha asks Zero for the leader address via the membership state. Zero responds with the stale address from the WAL, and Alpha switches all subsequent communication to that stale address. If the old address is unreachable, the cluster is effectively broken.
To Reproduce
Steps to reproduce the behavior:
Steps to Reproduce
Prerequisites:
echo "127.0.0.1 zero1 alpha1" | sudo tee -a /etc/hosts
mkdir -p /tmp/fqdn-bug/{zw,w}
echo '_:a <name> "Alice" .' > /tmp/fqdn-bug/test.rdf
echo 'name: string @index(term) .' > /tmp/fqdn-bug/test.schema
1. Start Zero without --my (defaults to localhost:5080):
dgraph zero --wal /tmp/fqdn-bug/zw --raft "idx=1;" --bindall
- Bulk load data:
dgraph bulk -f /tmp/fqdn-bug/test.rdf -s /tmp/fqdn-bug/test.schema \
--zero localhost:5080 --out /tmp/fqdn-bug/bulk-out --reduce_shards 1
- Start Alpha and let it register with Zero:
dgraph alpha -p /tmp/fqdn-bug/bulk-out/0/p -w /tmp/fqdn-bug/w \
--zero localhost:5080 --bindall
Wait for health check: curl http://localhost:8080/health
-
Stop both Alpha and Zero (Ctrl-C).
-
Restart with FQDNs, keeping old WAL directories:
# Zero with new --my
dgraph zero --wal /tmp/fqdn-bug/zw --raft "idx=1;" \
--my zero1:5080 --bindall
# Alpha with new --zero and --my
dgraph alpha -p /tmp/fqdn-bug/bulk-out/0/p -w /tmp/fqdn-bug/w \
--zero zero1:5080 --my alpha1:7080 --bindall
Current Behavior
Alpha logs show:
CONN: Connecting to zero1:5080 # initial connection via --zero (correct)
CONN: Connecting to localhost:5080 # switches to stale address from membership state
Got address of a Zero leader: localhost:5080
Starting a new membership stream receive from localhost:5080
zeros:{key:1 value:{id:1 addr:"localhost:5080" leader:true}} # stale address
Got Zero leader: localhost:5080
Zero's /state endpoint confirms: "zeros": {"1": {"addr": "localhost:5080"}} despite --my zero1:5080.
dgraph debug -w /tmp/fqdn-bug/zw shows the stale address baked into the WAL conf change entry:
1 . 1 . EntryConfChange . 41 B . Type: ConfChangeAddNode . NodeID: 0x1 . RaftContext: {Id:0x1 Addr:"localhost:5080" Group:0 Learner:false}
Expected behavior
Alpha logs should show only the new address:
CONN: Connecting to zero1:5080
Got address of a Zero leader: zero1:5080
Starting a new membership stream receive from zero1:5080
zeros:{key:1 value:{id:1 addr:"zero1:5080" leader:true}}
Got Zero leader: zero1:5080
This can be verified by deleting the WAL directories and restarting fresh (data in the p directory is unaffected).
Describe the bug
When Zero is restarted with a different --my address (e.g., migrating from localhost to a production FQDN), Zero continues to advertise the old address to Alphas via its membership state. This is because the original address is stored in a Raft conf change entry in Zero's WAL and replayed on every restart, with no code to reconcile it against the current --my value.
Alpha uses the --zero flag only as a bootstrap address for the initial connection. After connecting, Alpha asks Zero for the leader address via the membership state. Zero responds with the stale address from the WAL, and Alpha switches all subsequent communication to that stale address. If the old address is unreachable, the cluster is effectively broken.
To Reproduce
Steps to reproduce the behavior:
Steps to Reproduce
Prerequisites:
Wait for health check: curl http://localhost:8080/health
Stop both Alpha and Zero (Ctrl-C).
Restart with FQDNs, keeping old WAL directories:
Current Behavior
Alpha logs show:
Zero's /state endpoint confirms: "zeros": {"1": {"addr": "localhost:5080"}} despite --my zero1:5080.
dgraph debug -w /tmp/fqdn-bug/zw shows the stale address baked into the WAL conf change entry:
1 . 1 . EntryConfChange . 41 B . Type: ConfChangeAddNode . NodeID: 0x1 . RaftContext: {Id:0x1 Addr:"localhost:5080" Group:0 Learner:false}
Expected behavior
Alpha logs should show only the new address: