Skip to content

Title: Zero advertises stale address from WAL after --my flag change, causing Alpha to connect to wrong address #9676

@shiva-istari

Description

@shiva-istari

Describe the bug

When Zero is restarted with a different --my address (e.g., migrating from localhost to a production FQDN), Zero continues to advertise the old address to Alphas via its membership state. This is because the original address is stored in a Raft conf change entry in Zero's WAL and replayed on every restart, with no code to reconcile it against the current --my value.

Alpha uses the --zero flag only as a bootstrap address for the initial connection. After connecting, Alpha asks Zero for the leader address via the membership state. Zero responds with the stale address from the WAL, and Alpha switches all subsequent communication to that stale address. If the old address is unreachable, the cluster is effectively broken.

To Reproduce

Steps to reproduce the behavior:
Steps to Reproduce
Prerequisites:

echo "127.0.0.1  zero1 alpha1" | sudo tee -a /etc/hosts
mkdir -p /tmp/fqdn-bug/{zw,w}
echo '_:a <name> "Alice" .' > /tmp/fqdn-bug/test.rdf
echo 'name: string @index(term) .' > /tmp/fqdn-bug/test.schema
1. Start Zero without --my (defaults to localhost:5080):

dgraph zero --wal /tmp/fqdn-bug/zw --raft "idx=1;" --bindall
  1. Bulk load data:
dgraph bulk -f /tmp/fqdn-bug/test.rdf -s /tmp/fqdn-bug/test.schema \
  --zero localhost:5080 --out /tmp/fqdn-bug/bulk-out --reduce_shards 1
  1. Start Alpha and let it register with Zero:
dgraph alpha -p /tmp/fqdn-bug/bulk-out/0/p -w /tmp/fqdn-bug/w \
  --zero localhost:5080 --bindall

Wait for health check: curl http://localhost:8080/health

  1. Stop both Alpha and Zero (Ctrl-C).

  2. Restart with FQDNs, keeping old WAL directories:

# Zero with new --my
dgraph zero --wal /tmp/fqdn-bug/zw --raft "idx=1;" \
  --my zero1:5080 --bindall
# Alpha with new --zero and --my
dgraph alpha -p /tmp/fqdn-bug/bulk-out/0/p -w /tmp/fqdn-bug/w \
  --zero zero1:5080 --my alpha1:7080 --bindall

Current Behavior

Alpha logs show:

CONN: Connecting to zero1:5080                  # initial connection via --zero (correct)
CONN: Connecting to localhost:5080               # switches to stale address from membership state
Got address of a Zero leader: localhost:5080
Starting a new membership stream receive from localhost:5080
zeros:{key:1 value:{id:1 addr:"localhost:5080" leader:true}}   # stale address
Got Zero leader: localhost:5080

Zero's /state endpoint confirms: "zeros": {"1": {"addr": "localhost:5080"}} despite --my zero1:5080.

dgraph debug -w /tmp/fqdn-bug/zw shows the stale address baked into the WAL conf change entry:

1 . 1 . EntryConfChange . 41 B . Type: ConfChangeAddNode . NodeID: 0x1 . RaftContext: {Id:0x1 Addr:"localhost:5080" Group:0 Learner:false}

Expected behavior

Alpha logs should show only the new address:

CONN: Connecting to zero1:5080
Got address of a Zero leader: zero1:5080
Starting a new membership stream receive from zero1:5080
zeros:{key:1 value:{id:1 addr:"zero1:5080" leader:true}}
Got Zero leader: zero1:5080
This can be verified by deleting the WAL directories and restarting fresh (data in the p directory is unaffected).

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions