Skip to content

fix: zero address change#9680

Merged
matthewmcneely merged 2 commits intomainfrom
shiva/fqdn
Apr 22, 2026
Merged

fix: zero address change#9680
matthewmcneely merged 2 commits intomainfrom
shiva/fqdn

Conversation

@shiva-istari
Copy link
Copy Markdown
Contributor

@shiva-istari shiva-istari commented Apr 8, 2026

Description
When a Zero node is bootstrapped, its --my address is permanently recorded in the Raft WAL via a ConfChangeAddNode entry. On every subsequent restart, WAL replay restores that original address into MembershipState, overwriting the current --my flag. This causes Alphas to connect to a stale address (e.g., localhost:5080 from an initial bulk load) even when Zero is restarted with a production FQDN.
This PR introduces a leader-driven reconciliation loop that detects mismatches between the live --my address and what is stored in MembershipState, and proposes a ConfChangeUpdateNode through Raft to durably correct the address across all nodes. An in-memory override in the Connect RPC ensures Alphas receive the correct address immediately, even before the Raft entry is committed.
fixes #9676

@github-actions github-actions Bot added area/testing Testing related issues go Pull requests that update Go code labels Apr 8, 2026
@matthewmcneely matthewmcneely changed the title fix zero address change fix: zero address change Apr 15, 2026
@shiva-istari shiva-istari marked this pull request as ready for review April 21, 2026 10:44
@shiva-istari shiva-istari requested a review from a team as a code owner April 21, 2026 10:44
@github-actions github-actions Bot added the area/core internal mechanisms label Apr 21, 2026
@blacksmith-sh

This comment has been minimized.

@blacksmith-sh

This comment has been minimized.

@matthewmcneely
Copy link
Copy Markdown
Contributor

Smoke Tests on k8s

Verify the failure: bulk load -> k8s deploy reproduces the stale address bug

Ran a standalone Zero locally with --my=localhost:5080, bulk loaded the 21M movie dataset (3 shards), then injected the Zero WAL and Alpha p directories into a fresh Docker Desktop k8s cluster (1 Zero, 3 Alphas, sharded config, dgraph/dgraph:latest v25.3.3). When the k8s Zero started, WAL replay overwrote its live --my address with localhost:5080 — confirmed via /state. All three Alphas got stuck in a connection-refused loop trying to dial localhost:5080 inside their containers and never became Ready.

Verified fix works: bulk load -> k8s deploy with dgraph/dgraph:local (built on shiva/fqdn)

Rebuilt dgraph/dgraph:local from the PR branch (shiva/fqdn), redeployed the k8s cluster, and injected the same stale Zero WAL (localhost:5080) and bulk-loaded Alpha data from the bug repro. On startup, Zero's reconciliation loop detected the mismatch within seconds of leader election and corrected it via Raft:

Zero 0x1 address mismatch: MembershipState has "localhost:5080",
  expected "dgraph-dgraph-zero-0.dgraph-dgraph-zero-headless.dgraph.svc.cluster.local:5080".
  Proposing ConfChangeUpdateNode.
Applied ConfChangeUpdateNode for Zero 0x1:
  addr="dgraph-dgraph-zero-0.dgraph-dgraph-zero-headless.dgraph.svc.cluster.local:5080"

All three Alphas came up Ready, connected to Zero at the corrected address, and served queries against the 21M movie dataset across all 3 shard groups. This is the same stale WAL that caused a total cluster failure on main — the fix resolves it without manual intervention.

@matthewmcneely matthewmcneely merged commit 2da01c5 into main Apr 22, 2026
30 of 32 checks passed
@matthewmcneely matthewmcneely deleted the shiva/fqdn branch April 22, 2026 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/core internal mechanisms area/testing Testing related issues go Pull requests that update Go code

Development

Successfully merging this pull request may close these issues.

Title: Zero advertises stale address from WAL after --my flag change, causing Alpha to connect to wrong address

2 participants