Skip to content

Commit 972a538

Browse files
authored
Enhance monitoring guidance for Event Hubs performance
Added detailed monitoring guidance for Azure Event Hubs performance issues, including metrics interpretation and troubleshooting steps.
1 parent 97f69e3 commit 972a538

1 file changed

Lines changed: 274 additions & 0 deletions

File tree

Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
# Azure Event Hubs: Performance Monitoring - Overview
2+
3+
Costa Rica
4+
5+
[![GitHub](https://badgen.net/badge/icon/github?icon=github&label)](https://github.com)
6+
[![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/)
7+
[brown9804](https://github.com/brown9804)
8+
9+
Last updated: 2025-07-17
10+
11+
----------
12+
13+
<details>
14+
<summary><b>List of References</b> (Click to expand)</summary>
15+
16+
- [Azure Event Hubs quotas and limits](https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-quotas#basic-vs-standard-vs-premium-vs-dedicated-tiers)
17+
18+
</details>
19+
20+
## Tiers
21+
22+
> Here are list of quotas and limits depending on your Event Hubs tier: [Basic vs. standard vs. premium vs. dedicated tiers](https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-quotas#basic-vs-standard-vs-premium-vs-dedicated-tiers)
23+
24+
| **Tier** | **Ingress (Send)** | **Egress (Receive)** | **Scalability & Notes** |
25+
|----------------|---------------------------------------------|----------------------------------------------|------------------------------------------------------------------------------------------|
26+
| **Basic** | ~1 MB/s or 1,000 events/sec per TU | ~2 MB/s or 4,096 events/sec per TU | Max 20 TUs per namespace. No Capture feature. Ideal for small workloads. |
27+
| **Standard** | Same as Basic | Same as Basic | Includes Capture. Max 40 TUs per namespace. Supports scaling via TUs. |
28+
| **Premium** | ~5–10 MB/s ingress per PU | ~10–20 MB/s egress per PU | Uses Processing Units (PUs). Dedicated resources. Scales by adding PUs. |
29+
| **Dedicated** | Fully customizable (via Capacity Units) | Fully customizable (via Capacity Units) | Enterprise-grade. Scales via Capacity Units (CUs). Ideal for massive, mission-critical workloads.|
30+
31+
> [!NOTE]
32+
> - **TU = Throughput Unit** (Basic/Standard): Each `TU gives you ~1 MB/s ingress and ~2 MB/s egress.`
33+
> - **PU = Processing Unit** (Premium): Each PU offers `~5–10 MB/s ingress and ~10–20 MB/s egress depending on partition count and consumer efficiency.`
34+
> - **CU = Capacity Unit** (Dedicated): Custom scaling based on cluster configuration. `You can scale in/out manually or via support ticket.`
35+
36+
E.g
37+
> If you're ingesting large volumes (e.g., 15 MB/s), you'd need:
38+
- **15 TUs** in Standard tier (if evenly distributed).
39+
- Or **2–3 PUs** in Premium tier for smoother performance and lower cost.
40+
41+
## Monitoring with Metrics
42+
43+
> If you want to spot any gaps or delays in data flow with Azure Event Hubs, it’s a good idea to keep an eye on `some key metrics at each stage of the
44+
> ingestion and consumption process.` These metrics can help you figure out if there’s latency, throttling,
45+
> or data loss happening, whether it’s from the producers, inside Event Hubs, or when data moves to consumers or storage.
46+
47+
https://github.com/user-attachments/assets/2616c7fb-81b5-4365-9346-0332b91cc892
48+
49+
> [!TIP]
50+
> - Incoming Requests → show `how much data is being sent.`
51+
> - Successful Requests → confirm `it's being accepted.`
52+
> - Incoming Messages → count the `actual events received.`
53+
> - Throttled Requests → warn if `you're hitting limits.`
54+
> - Outgoing Messages/Bytes → show `how much is being consumed.`
55+
> - Capture Backlog → reveals if `data is stuck waiting to be stored.`
56+
57+
| **Metric** | **What It Tells You** | **Why It Matters for Delay Diagnosis** |
58+
|------------------------|----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
59+
| **Incoming Requests** | Number of API calls made to send data into Event Hubs. | High volume confirms producers are active. If low, delay may be upstream. |
60+
| **Successful Requests**| Requests that were accepted and processed by Event Hubs. | Confirms Event Hubs is not rejecting traffic. Low values suggest issues with authentication or limits. |
61+
| **Throttled Requests** | Requests rejected due to throughput or quota limits. | High values indicate Event Hubs is overloaded—can directly cause ingestion delays. |
62+
| **Incoming Messages** | Number of events successfully received by Event Hubs. | Confirms actual data flow. If low despite high requests, payloads may be malformed or dropped. |
63+
| **Outgoing Messages** | Number of events read by consumers (e.g., Splunk Connect). | If low compared to Incoming Messages, consumers may be lagging, misconfigured, or disconnected. |
64+
| **Outgoing Bytes** | Total size of data read by consumers. | Helps assess payload size and bandwidth. Large payloads may slow delivery even if message count is fine.|
65+
| **Capture Backlog** | Volume of data waiting to be written to storage (Blob/Data Lake). | If using Capture, a growing backlog signals Event Hubs is falling behind or storage is misconfigured. |
66+
67+
E.g
68+
<img width="1901" height="838" alt="image" src="https://github.com/user-attachments/assets/54bf9b1e-c56c-4509-9716-a47848ce4359" />
69+
70+
```mermaid
71+
graph TD
72+
A[Delay in Splunk ingestion]
73+
74+
A --> B[Check Incoming Requests]
75+
B -->|= 0| B1[No data sent → Check producer]
76+
B -->|> 0| C[Check Incoming Messages]
77+
C -->|= 0| C1[Malformed payloads → Validate format]
78+
C -->|> 0| D[Check Throttled Requests]
79+
D -->|> 0| D1[Throttling → Scale throughput]
80+
D -->|= 0| E[Check Outgoing Messages]
81+
E -->|= 0| E1[No consumer → Restart Splunk Connect]
82+
E -->|< Incoming Messages| E2[Consumer lag → Tune consumer]
83+
E --> F[Check Capture Backlog]
84+
F -->|Increasing| F1[Capture delay → Check storage]
85+
F -->|= 0 but delay| F2[Consumer issue → Check Splunk pipeline]
86+
```
87+
88+
https://github.com/user-attachments/assets/e84aee9c-6b2c-47c3-a0bf-0d25f9ecf0e7
89+
90+
<details>
91+
<summary><strong>Incoming Requests = 0</strong></summary>
92+
93+
> - `Interpretation`: No data is being sent to Event Hubs. This typically means the upstream producer is inactive, misconfigured, or disconnected.
94+
> - `Why This Matters`: If no requests are arriving, the delay is upstream—not within Event Hubs. This is the first checkpoint in the ingestion pipeline.
95+
> - `Actions to Take`:
96+
> - Verify that producers are running and targeting the correct Event Hub.
97+
> - Check authentication credentials and network connectivity.
98+
> - Review producer logs for errors or dropped messages.
99+
> - Ensure DNS resolution and firewall rules allow outbound traffic.
100+
</details>
101+
102+
<details>
103+
<summary><strong>Incoming Requests > 0 but Incoming Messages = 0</strong></summary>
104+
105+
> - `Interpretation`: Requests are reaching Event Hubs, but no events are being accepted. This may indicate malformed payloads or schema mismatches.
106+
> - `Why This Matters`: Traffic is arriving, but Event Hubs is unable to process it—likely due to formatting or validation issues.
107+
> - `Actions to Take`:
108+
> - Validate payload structure and encoding.
109+
> - Check for serialization errors or schema mismatches.
110+
> - Ensure the producer SDK is compatible with Event Hubs.
111+
> - Review Event Hubs logs for dropped or rejected messages.
112+
</details>
113+
114+
<details>
115+
<summary><strong>Successful Requests < Incoming Requests</strong></summary>
116+
117+
> - `Interpretation`: Some requests are failing—likely due to quota limits, authentication errors, or SDK misconfigurations.
118+
> - `Why This Matters`: Failed requests mean data is not entering the pipeline, which can result in partial ingestion or silent data loss.
119+
> - `Actions to Take`:
120+
> - Review Azure Monitor logs for error codes.
121+
> - Check Event Hubs quota limits and authentication credentials.
122+
> - Ensure the SDK is up to date and properly configured.
123+
> - Monitor retry logic and backoff settings.
124+
</details>
125+
126+
<details>
127+
<summary><strong>Throttled Requests > 0</strong></summary>
128+
129+
> - `Interpretation`: Event Hubs is rejecting requests due to throughput limits being exceeded.
130+
> - `Why This Matters`: Throttling causes ingestion delays and can lead to dropped messages if retries aren't configured.
131+
> - `Actions to Take`:
132+
> - Scale up throughput units (TUs).
133+
> - Optimize producer batching and retry logic.
134+
> - Consider upgrading to Premium or Dedicated tier.
135+
> - Monitor partition distribution to avoid hotspots.
136+
</details>
137+
138+
<details>
139+
<summary><strong>Incoming Messages > Outgoing Messages</strong></summary>
140+
141+
> - `Interpretation`: Consumers are lagging behind. This could be due to slow processing, misconfiguration, or disconnection.
142+
> - `Why This Matters`: Data accumulates in Event Hubs and delays downstream ingestion. This can lead to increased costs or message expiration.
143+
> - `Actions to Take`:
144+
> - Check consumer health and retry settings.
145+
> - Monitor partition lag and scale out consumer instances.
146+
> - Review consumer logs for processing delays.
147+
> - Ensure checkpointing is functioning correctly.
148+
</details>
149+
150+
<details>
151+
<summary><strong>Outgoing Messages = 0</strong></summary>
152+
153+
> - `Interpretation`: No consumer is reading from Event Hubs. This may indicate a broken connection or inactive consumer.
154+
> - `Why This Matters`: If no one is reading, data will remain in Event Hubs and eventually expire or overflow.
155+
> - `Actions to Take`:
156+
> - Restart consumer services and validate connection strings.
157+
> - Check logs for authentication or network errors.
158+
> - Ensure consumer is subscribed to the correct Event Hub and partitions.
159+
> - Confirm that the consumer group is active and not blocked.
160+
</details>
161+
162+
<details>
163+
<summary><strong>Outgoing Bytes unusually high per message</strong></summary>
164+
165+
> - `Interpretation`: Payloads are large, which may slow down delivery due to bandwidth constraints or inefficient serialization.
166+
> - `Why This Matters`: Large payloads can saturate the egress pipeline and delay consumer processing.
167+
> - `Actions to Take`:
168+
> - Compress payloads or reduce message size.
169+
> - Use efficient serialization formats.
170+
> - Increase consumer capacity or parallelism.
171+
> - Monitor egress bandwidth and adjust partitioning strategy.
172+
</details>
173+
174+
<details>
175+
<summary><strong>Capture Backlog increasing over time</strong></summary>
176+
177+
> - `Interpretation`: Event Hubs is falling behind on writing to Blob Storage or Data Lake.
178+
> - `Why This Matters`: A growing backlog indicates that data is stuck in Event Hubs and not being flushed to storage.
179+
> - `Actions to Take`:
180+
> - Check storage account health and write permissions.
181+
> - Review capture configuration and partition mapping.
182+
> - Monitor Event Hubs throughput and scale if needed.
183+
> - Validate that the capture destination is reachable.
184+
</details>
185+
186+
<details>
187+
<summary><strong>Capture Backlog = 0 but Outgoing Messages still delayed</strong></summary>
188+
189+
> - `Interpretation`: Event Hubs is writing to storage successfully, but consumers are still slow.
190+
> - `Why This Matters`: This isolates the delay to the consumer side, not Event Hubs or storage.
191+
> - `Actions to Take`:
192+
> - Investigate consumer performance and ingestion rate.
193+
> - Check for downstream bottlenecks or queue saturation.
194+
> - Review Splunk Connect logs and retry behavior.
195+
</details>
196+
197+
## Support and possible considerations
198+
199+
<details>
200+
<summary><strong>Low throughput despite normal metrics</strong></summary>
201+
202+
> - `Interpretation`: Event Hubs may be under-provisioned for your workload.
203+
> - `Why This Matters`: This condition can mask performance issues and lead to inconsistent delivery rates.
204+
> - `Actions to Take`:
205+
> - Review throughput unit (TU) allocation.
206+
> - Consider scaling up or moving to Premium/Dedicated tier.
207+
> - Check partition count and ensure load is evenly distributed.
208+
</details>
209+
210+
<details>
211+
<summary><strong>Region mismatch between Event Hubs and consumers</strong></summary>
212+
213+
> - `Interpretation`: If Event Hubs and consumers are in different Azure regions, network latency can introduce delays.
214+
> - `Why This Matters`: Cross-region traffic increases latency and can degrade performance.
215+
> - `Actions to Take`:
216+
> - Align Event Hubs and consumer services to the same region.
217+
> - Use Azure Network Watcher to measure latency.
218+
> - Consider ExpressRoute or private endpoints for critical workloads.
219+
</details>
220+
221+
<details>
222+
<summary><strong>Partition imbalance</strong></summary>
223+
224+
> - `Interpretation`: One partition receives significantly more traffic than others, creating a bottleneck.
225+
> - `Why This Matters`: Uneven partition load can lead to throttling, consumer lag, and inefficient resource usage.
226+
> - `Actions to Take`:
227+
> - Use a custom partitioning strategy or round-robin.
228+
> - Monitor per-partition metrics.
229+
> - Rebalance producers and ensure consumers are evenly distributed.
230+
</details>
231+
232+
<details>
233+
<summary><strong>Consumer checkpointing is delayed</strong></summary>
234+
235+
> - `Interpretation`: Consumers aren’t checkpointing frequently, which may cause reprocessing or lag.
236+
> - `Why This Matters`: Delayed checkpointing increases latency and can cause duplicate processing.
237+
> - `Actions to Take`:
238+
> - Review checkpointing interval and logic.
239+
> - Ensure storage used for checkpoints is healthy.
240+
> - Monitor EventProcessorClient logs.
241+
> - Validate that checkpointing is enabled and functioning.
242+
</details>
243+
244+
<details>
245+
<summary><strong>Consumer SDK version is outdated</strong></summary>
246+
247+
> - `Interpretation`: Older SDKs may have inefficient polling, retry logic, or lack performance features.
248+
> - `Why This Matters`: Using outdated SDKs can introduce latency and reduce reliability.
249+
> - `Actions to Take`:
250+
> - Upgrade to the latest Event Hubs SDK.
251+
> - Review changelogs for performance improvements.
252+
> - Validate retry and prefetch settings.
253+
> - Test ingestion performance after upgrade.
254+
</details>
255+
256+
<details>
257+
<summary><strong>Capture is enabled but unused</strong></summary>
258+
259+
> - `Interpretation`: Event Hubs Capture is turned on, but the data is not being consumed or stored downstream. This can lead to unnecessary resource usage and potential confusion in diagnostics.
260+
> - `Why This Matters`: Capture consumes throughput and storage resources. If it's enabled but unused, it may affect performance or cost without delivering value.
261+
> - `Actions to Take`:
262+
> - Review whether Capture is needed for your workload.
263+
> - Disable Capture if it's not actively used for analytics or archival.
264+
> - Check storage account configuration to ensure it's not misconfigured or unreachable.
265+
> - Monitor Capture metrics to confirm data is being written and accessed.
266+
</details>
267+
268+
269+
<!-- START BADGE -->
270+
<div align="center">
271+
<img src="https://img.shields.io/badge/Total%20views-1443-limegreen" alt="Total views">
272+
<p>Refresh Date: 2025-09-05</p>
273+
</div>
274+
<!-- END BADGE -->

0 commit comments

Comments
 (0)