|
| 1 | +# Integration Test Findings |
| 2 | + |
| 3 | +## Finding #1: Infinite Loop in Filtered Purge & Resubmit (Critical) |
| 4 | + |
| 5 | +**Severity:** Critical — hangs indefinitely in production, consuming resources and never completing. |
| 6 | + |
| 7 | +**Discovered by:** Integration test `PurgeDlqIntegrationShould.RemoveOnlyMatchingMessages_WhenCategoryAndTimeFiltersProvided` |
| 8 | + |
| 9 | +**Affected handlers:** |
| 10 | +- `PurgeDlqMessagesCommandHandler.PurgeWithFilterAsync` |
| 11 | +- `ResubmitDlqMessagesCommandHandler.ResubmitWithFilterAsync` |
| 12 | + |
| 13 | +### Description |
| 14 | + |
| 15 | +Both filtered handlers use a `while` loop that continues receiving batches from the DLQ until `emptyBatches` reaches a threshold of 3. The original code unconditionally reset `emptyBatches = 0` whenever a non-empty batch was received: |
| 16 | + |
| 17 | +```csharp |
| 18 | +if (messages.Count == 0) |
| 19 | +{ |
| 20 | + emptyBatches++; |
| 21 | + continue; |
| 22 | +} |
| 23 | + |
| 24 | +emptyBatches = 0; // BUG: always resets, even if no messages matched the filter |
| 25 | +``` |
| 26 | + |
| 27 | +When a filter is applied and some messages don't match, those messages are **abandoned** back to the DLQ. On the next iteration, the receiver picks them up again. Because the batch is never empty (the same non-matching messages keep returning), `emptyBatches` never reaches the threshold and the loop runs forever. |
| 28 | + |
| 29 | +### Root cause |
| 30 | + |
| 31 | +The `emptyBatches` counter was tracking "did we receive any messages?" when it should have been tracking "did we make any progress?" A batch where every message is abandoned represents zero progress and should be treated the same as an empty batch. |
| 32 | + |
| 33 | +### Fix |
| 34 | + |
| 35 | +Two changes were required: |
| 36 | + |
| 37 | +**1. Termination:** Reset `emptyBatches` only when at least one message was actually processed (completed/resubmitted). When no messages matched the filter, increment `emptyBatches` instead: |
| 38 | + |
| 39 | +```csharp |
| 40 | +if (toComplete.Count > 0) |
| 41 | +{ |
| 42 | + emptyBatches = 0; |
| 43 | +} |
| 44 | +else |
| 45 | +{ |
| 46 | + emptyBatches++; |
| 47 | +} |
| 48 | +``` |
| 49 | + |
| 50 | +**2. Accurate skip count:** Replace the `totalSkipped` counter with a `HashSet<long>` tracking sequence numbers. Since abandoned messages are re-received across multiple iterations, a simple counter would count the same message multiple times (e.g., 2 non-matching messages × 4 iterations = 8 reported skipped, when only 2 were actually skipped): |
| 51 | + |
| 52 | +```csharp |
| 53 | +var skippedSequenceNumbers = new HashSet<long>(); |
| 54 | +// ... |
| 55 | +foreach (var m in toAbandon) |
| 56 | +{ |
| 57 | + skippedSequenceNumbers.Add(m.SequenceNumber); |
| 58 | +} |
| 59 | +// Result uses skippedSequenceNumbers.Count |
| 60 | +``` |
| 61 | + |
| 62 | +### Why this matters |
| 63 | + |
| 64 | +This bug would cause the CLI to hang indefinitely whenever a user runs `purge-dlq` or `resubmit-dlq` with a category or time filter on a DLQ that contains messages not matching the filter. The only escape would be forcefully terminating the process. |
| 65 | + |
| 66 | +The bug was invisible to unit tests because they mock the `ServiceBusReceiver` and control what messages are returned. In unit tests, the receiver returns a pre-determined sequence and eventually returns an empty batch, so the loop terminates. Only a real Service Bus receiver exhibits the behavior where abandoned messages reappear in subsequent receive calls. |
| 67 | + |
| 68 | +### Value of integration testing |
| 69 | + |
| 70 | +This finding demonstrates precisely why integration tests against a real (emulated) Service Bus are essential: |
| 71 | + |
| 72 | +1. **Real message lifecycle** — Unit tests cannot replicate the fact that abandoned DLQ messages return to the queue and are re-received. |
| 73 | +2. **Behavioral fidelity** — The emulator faithfully reproduces the abandon-and-reappear behavior of Azure Service Bus, exposing the infinite loop. |
| 74 | +3. **Timeout as a signal** — The test hung instead of completing, making the bug immediately obvious. A unit test with mocked receivers would have passed silently. |
| 75 | + |
| 76 | +This single finding alone justifies the investment in the integration test infrastructure. |
0 commit comments