fix: zero address change by shiva-istari · Pull Request #9680 · dgraph-io/dgraph

shiva-istari · 2026-04-08T16:52:59Z

Description
When a Zero node is bootstrapped, its --my address is permanently recorded in the Raft WAL via a ConfChangeAddNode entry. On every subsequent restart, WAL replay restores that original address into MembershipState, overwriting the current --my flag. This causes Alphas to connect to a stale address (e.g., localhost:5080 from an initial bulk load) even when Zero is restarted with a production FQDN.
This PR introduces a leader-driven reconciliation loop that detects mismatches between the live --my address and what is stored in MembershipState, and proposes a ConfChangeUpdateNode through Raft to durably correct the address across all nodes. An in-memory override in the Connect RPC ensures Alphas receive the correct address immediately, even before the Raft entry is committed.
fixes #9676

matthewmcneely · 2026-04-22T21:26:53Z

Smoke Tests on k8s

Verify the failure: bulk load -> k8s deploy reproduces the stale address bug

Ran a standalone Zero locally with --my=localhost:5080, bulk loaded the 21M movie dataset (3 shards), then injected the Zero WAL and Alpha p directories into a fresh Docker Desktop k8s cluster (1 Zero, 3 Alphas, sharded config, dgraph/dgraph:latest v25.3.3). When the k8s Zero started, WAL replay overwrote its live --my address with localhost:5080 — confirmed via /state. All three Alphas got stuck in a connection-refused loop trying to dial localhost:5080 inside their containers and never became Ready.

Verified fix works: bulk load -> k8s deploy with dgraph/dgraph:local (built on shiva/fqdn)

Rebuilt dgraph/dgraph:local from the PR branch (shiva/fqdn), redeployed the k8s cluster, and injected the same stale Zero WAL (localhost:5080) and bulk-loaded Alpha data from the bug repro. On startup, Zero's reconciliation loop detected the mismatch within seconds of leader election and corrected it via Raft:

Zero 0x1 address mismatch: MembershipState has "localhost:5080",
  expected "dgraph-dgraph-zero-0.dgraph-dgraph-zero-headless.dgraph.svc.cluster.local:5080".
  Proposing ConfChangeUpdateNode.
Applied ConfChangeUpdateNode for Zero 0x1:
  addr="dgraph-dgraph-zero-0.dgraph-dgraph-zero-headless.dgraph.svc.cluster.local:5080"

All three Alphas came up Ready, connected to Zero at the corrected address, and served queries against the 21M movie dataset across all 3 shard groups. This is the same stale WAL that caused a total cluster failure on main — the fix resolves it without manual intervention.

github-actions Bot added area/testing Testing related issues go Pull requests that update Go code labels Apr 8, 2026

shiva-istari requested a review from matthewmcneely April 8, 2026 18:30

matthewmcneely changed the title ~~fix zero address change~~ fix: zero address change Apr 15, 2026

matthewmcneely force-pushed the shiva/fqdn branch from b03473a to 1ceee04 Compare April 15, 2026 21:17

amalistari mentioned this pull request Apr 16, 2026

Enhance lsbackup with Date Filtering and Lightweight Listing #9684

Open

shiva-istari force-pushed the shiva/fqdn branch from 1ceee04 to b1dd5a6 Compare April 21, 2026 10:44

shiva-istari marked this pull request as ready for review April 21, 2026 10:44

shiva-istari requested a review from a team as a code owner April 21, 2026 10:44

github-actions Bot added the area/core internal mechanisms label Apr 21, 2026

This comment has been minimized.

Sign in to view

shiva-istari force-pushed the shiva/fqdn branch from b1dd5a6 to da212d3 Compare April 21, 2026 13:42

Shivaji Kharse added 2 commits April 22, 2026 18:18

fix zero address change

07457d1

add tests and fix issues related logs

dbf0974

shiva-istari force-pushed the shiva/fqdn branch from da212d3 to dbf0974 Compare April 22, 2026 12:49

shiva-istari self-assigned this Apr 22, 2026

shiva-istari requested a review from mlwelles April 22, 2026 12:52

This comment has been minimized.

Sign in to view

matthewmcneely approved these changes Apr 22, 2026

View reviewed changes

matthewmcneely merged commit 2da01c5 into main Apr 22, 2026
30 of 32 checks passed

matthewmcneely deleted the shiva/fqdn branch April 22, 2026 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: zero address change#9680

fix: zero address change#9680
matthewmcneely merged 2 commits intomainfrom
shiva/fqdn

shiva-istari commented Apr 8, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

This comment has been minimized.

matthewmcneely commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

shiva-istari commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

matthewmcneely commented Apr 22, 2026

Smoke Tests on k8s

Verify the failure: bulk load -> k8s deploy reproduces the stale address bug

Verified fix works: bulk load -> k8s deploy with dgraph/dgraph:local (built on shiva/fqdn)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

shiva-istari commented Apr 8, 2026 •

edited

Loading