Skip to content

Commit 7d7ce38

Browse files
Pin the find_leader skip-and-continue path under a bad handshake
The find_leader loop at cluster.py:109-132 iterates candidate nodes and catches transport/protocol errors per node so that one bad peer does not wedge discovery. The wrapping chain that makes this work — wire DecodeError/ProtocolError from a malformed handshake response gets re-raised as client ProtocolError by _read_response and lands in the except tuple — was only covered transitively through the single-node `test_query_leader_closes_writer_on_handshake_error`. Add a two-node test that replies to the first node with 64 bytes of zeros (decodes as a FAILURE-typed frame whose body is too short for the uint64 code, raising wire DecodeError) and to the second with a valid Welcome + Leader pair, and assert that find_leader skips the bad node and returns a valid leader address. The test is deterministic under the cluster's intentional node shuffle because the mock_reader's side_effect sequence is consumed in call order — whichever node dials first consumes the bad handshake bytes, the other consumes the valid welcome. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 877c0a8 commit 7d7ce38

File tree

1 file changed

+42
-0
lines changed

1 file changed

+42
-0
lines changed

tests/test_cluster.py

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,48 @@ async def test_query_leader_closes_writer_on_protocol_init_error(self) -> None:
187187
# Even though DqliteProtocol() failed, the writer must be closed
188188
mock_writer.close.assert_called()
189189

190+
async def test_find_leader_skips_node_with_bad_handshake(self) -> None:
191+
"""When a node's handshake round-trip fails (malformed response
192+
triggers a wire-level ProtocolError), ``find_leader`` must move
193+
on to the next node and succeed. Guards the skip-and-continue
194+
loop at cluster.py:109-132 against regressions — a bug in the
195+
except tuple would turn one bad peer into a cluster-wide failure.
196+
"""
197+
store = MemoryNodeStore(["localhost:9001", "localhost:9002"])
198+
client = ClusterClient(store, timeout=1.0)
199+
200+
mock_reader = AsyncMock()
201+
mock_writer = MagicMock()
202+
mock_writer.drain = AsyncMock()
203+
mock_writer.close = MagicMock()
204+
mock_writer.wait_closed = AsyncMock()
205+
206+
from dqlitewire.messages import LeaderResponse, WelcomeResponse
207+
208+
# Node A responds with 64 bytes of zeros: that decodes as a
209+
# FAILURE-typed frame with a zero-byte body, and FailureResponse.
210+
# decode_body raises DecodeError because the body is too short
211+
# for the uint64 code. The wire ProtocolError is wrapped into
212+
# the client ProtocolError by DqliteProtocol._read_response and
213+
# caught by find_leader, which moves on to Node B.
214+
#
215+
# Node B responds with a valid Welcome + Leader pair, so
216+
# find_leader returns Node B's address.
217+
responses = [
218+
b"\x00" * 64,
219+
WelcomeResponse(heartbeat_timeout=15000).encode(),
220+
LeaderResponse(node_id=2, address="").encode(),
221+
]
222+
mock_reader.read.side_effect = responses
223+
224+
with patch("asyncio.open_connection", return_value=(mock_reader, mock_writer)):
225+
leader = await client.find_leader()
226+
227+
assert leader in {"localhost:9001", "localhost:9002"}
228+
# Two open_connection attempts expected (one per node); the
229+
# failing node's writer was closed before the skip.
230+
assert mock_writer.close.call_count >= 2
231+
190232
async def test_query_leader_does_not_hang_on_slow_wait_closed(self) -> None:
191233
"""_query_leader must not hang if wait_closed() blocks (e.g., unresponsive peer)."""
192234
import asyncio

0 commit comments

Comments
 (0)