|
| 1 | +# Azure Event Hubs: Performance Monitoring - Overview |
| 2 | + |
| 3 | +Costa Rica |
| 4 | + |
| 5 | +[](https://github.com) |
| 6 | +[](https://github.com/) |
| 7 | +[brown9804](https://github.com/brown9804) |
| 8 | + |
| 9 | +Last updated: 2025-07-17 |
| 10 | + |
| 11 | +---------- |
| 12 | + |
| 13 | +<details> |
| 14 | +<summary><b>List of References</b> (Click to expand)</summary> |
| 15 | + |
| 16 | +- [Azure Event Hubs quotas and limits](https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-quotas#basic-vs-standard-vs-premium-vs-dedicated-tiers) |
| 17 | + |
| 18 | +</details> |
| 19 | + |
| 20 | +## Tiers |
| 21 | + |
| 22 | +> Here are list of quotas and limits depending on your Event Hubs tier: [Basic vs. standard vs. premium vs. dedicated tiers](https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-quotas#basic-vs-standard-vs-premium-vs-dedicated-tiers) |
| 23 | +
|
| 24 | +| **Tier** | **Ingress (Send)** | **Egress (Receive)** | **Scalability & Notes** | |
| 25 | +|----------------|---------------------------------------------|----------------------------------------------|------------------------------------------------------------------------------------------| |
| 26 | +| **Basic** | ~1 MB/s or 1,000 events/sec per TU | ~2 MB/s or 4,096 events/sec per TU | Max 20 TUs per namespace. No Capture feature. Ideal for small workloads. | |
| 27 | +| **Standard** | Same as Basic | Same as Basic | Includes Capture. Max 40 TUs per namespace. Supports scaling via TUs. | |
| 28 | +| **Premium** | ~5–10 MB/s ingress per PU | ~10–20 MB/s egress per PU | Uses Processing Units (PUs). Dedicated resources. Scales by adding PUs. | |
| 29 | +| **Dedicated** | Fully customizable (via Capacity Units) | Fully customizable (via Capacity Units) | Enterprise-grade. Scales via Capacity Units (CUs). Ideal for massive, mission-critical workloads.| |
| 30 | + |
| 31 | +> [!NOTE] |
| 32 | +> - **TU = Throughput Unit** (Basic/Standard): Each `TU gives you ~1 MB/s ingress and ~2 MB/s egress.` |
| 33 | +> - **PU = Processing Unit** (Premium): Each PU offers `~5–10 MB/s ingress and ~10–20 MB/s egress depending on partition count and consumer efficiency.` |
| 34 | +> - **CU = Capacity Unit** (Dedicated): Custom scaling based on cluster configuration. `You can scale in/out manually or via support ticket.` |
| 35 | +
|
| 36 | +E.g |
| 37 | +> If you're ingesting large volumes (e.g., 15 MB/s), you'd need: |
| 38 | +- **15 TUs** in Standard tier (if evenly distributed). |
| 39 | +- Or **2–3 PUs** in Premium tier for smoother performance and lower cost. |
| 40 | + |
| 41 | +## Monitoring with Metrics |
| 42 | + |
| 43 | +> If you want to spot any gaps or delays in data flow with Azure Event Hubs, it’s a good idea to keep an eye on `some key metrics at each stage of the |
| 44 | +> ingestion and consumption process.` These metrics can help you figure out if there’s latency, throttling, |
| 45 | +> or data loss happening, whether it’s from the producers, inside Event Hubs, or when data moves to consumers or storage. |
| 46 | +
|
| 47 | +https://github.com/user-attachments/assets/2616c7fb-81b5-4365-9346-0332b91cc892 |
| 48 | + |
| 49 | +> [!TIP] |
| 50 | +> - Incoming Requests → show `how much data is being sent.` |
| 51 | +> - Successful Requests → confirm `it's being accepted.` |
| 52 | +> - Incoming Messages → count the `actual events received.` |
| 53 | +> - Throttled Requests → warn if `you're hitting limits.` |
| 54 | +> - Outgoing Messages/Bytes → show `how much is being consumed.` |
| 55 | +> - Capture Backlog → reveals if `data is stuck waiting to be stored.` |
| 56 | +
|
| 57 | +| **Metric** | **What It Tells You** | **Why It Matters for Delay Diagnosis** | |
| 58 | +|------------------------|----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------| |
| 59 | +| **Incoming Requests** | Number of API calls made to send data into Event Hubs. | High volume confirms producers are active. If low, delay may be upstream. | |
| 60 | +| **Successful Requests**| Requests that were accepted and processed by Event Hubs. | Confirms Event Hubs is not rejecting traffic. Low values suggest issues with authentication or limits. | |
| 61 | +| **Throttled Requests** | Requests rejected due to throughput or quota limits. | High values indicate Event Hubs is overloaded—can directly cause ingestion delays. | |
| 62 | +| **Incoming Messages** | Number of events successfully received by Event Hubs. | Confirms actual data flow. If low despite high requests, payloads may be malformed or dropped. | |
| 63 | +| **Outgoing Messages** | Number of events read by consumers (e.g., Splunk Connect). | If low compared to Incoming Messages, consumers may be lagging, misconfigured, or disconnected. | |
| 64 | +| **Outgoing Bytes** | Total size of data read by consumers. | Helps assess payload size and bandwidth. Large payloads may slow delivery even if message count is fine.| |
| 65 | +| **Capture Backlog** | Volume of data waiting to be written to storage (Blob/Data Lake). | If using Capture, a growing backlog signals Event Hubs is falling behind or storage is misconfigured. | |
| 66 | + |
| 67 | +E.g |
| 68 | +<img width="1901" height="838" alt="image" src="https://github.com/user-attachments/assets/54bf9b1e-c56c-4509-9716-a47848ce4359" /> |
| 69 | + |
| 70 | +```mermaid |
| 71 | +graph TD |
| 72 | + A[Delay in Splunk ingestion] |
| 73 | +
|
| 74 | + A --> B[Check Incoming Requests] |
| 75 | + B -->|= 0| B1[No data sent → Check producer] |
| 76 | + B -->|> 0| C[Check Incoming Messages] |
| 77 | + C -->|= 0| C1[Malformed payloads → Validate format] |
| 78 | + C -->|> 0| D[Check Throttled Requests] |
| 79 | + D -->|> 0| D1[Throttling → Scale throughput] |
| 80 | + D -->|= 0| E[Check Outgoing Messages] |
| 81 | + E -->|= 0| E1[No consumer → Restart Splunk Connect] |
| 82 | + E -->|< Incoming Messages| E2[Consumer lag → Tune consumer] |
| 83 | + E --> F[Check Capture Backlog] |
| 84 | + F -->|Increasing| F1[Capture delay → Check storage] |
| 85 | + F -->|= 0 but delay| F2[Consumer issue → Check Splunk pipeline] |
| 86 | +``` |
| 87 | + |
| 88 | +https://github.com/user-attachments/assets/e84aee9c-6b2c-47c3-a0bf-0d25f9ecf0e7 |
| 89 | + |
| 90 | +<details> |
| 91 | +<summary><strong>Incoming Requests = 0</strong></summary> |
| 92 | + |
| 93 | +> - `Interpretation`: No data is being sent to Event Hubs. This typically means the upstream producer is inactive, misconfigured, or disconnected. |
| 94 | +> - `Why This Matters`: If no requests are arriving, the delay is upstream—not within Event Hubs. This is the first checkpoint in the ingestion pipeline. |
| 95 | +> - `Actions to Take`: |
| 96 | +> - Verify that producers are running and targeting the correct Event Hub. |
| 97 | +> - Check authentication credentials and network connectivity. |
| 98 | +> - Review producer logs for errors or dropped messages. |
| 99 | +> - Ensure DNS resolution and firewall rules allow outbound traffic. |
| 100 | +</details> |
| 101 | +
|
| 102 | +<details> |
| 103 | +<summary><strong>Incoming Requests > 0 but Incoming Messages = 0</strong></summary> |
| 104 | + |
| 105 | +> - `Interpretation`: Requests are reaching Event Hubs, but no events are being accepted. This may indicate malformed payloads or schema mismatches. |
| 106 | +> - `Why This Matters`: Traffic is arriving, but Event Hubs is unable to process it—likely due to formatting or validation issues. |
| 107 | +> - `Actions to Take`: |
| 108 | +> - Validate payload structure and encoding. |
| 109 | +> - Check for serialization errors or schema mismatches. |
| 110 | +> - Ensure the producer SDK is compatible with Event Hubs. |
| 111 | +> - Review Event Hubs logs for dropped or rejected messages. |
| 112 | +</details> |
| 113 | +
|
| 114 | +<details> |
| 115 | +<summary><strong>Successful Requests < Incoming Requests</strong></summary> |
| 116 | + |
| 117 | +> - `Interpretation`: Some requests are failing—likely due to quota limits, authentication errors, or SDK misconfigurations. |
| 118 | +> - `Why This Matters`: Failed requests mean data is not entering the pipeline, which can result in partial ingestion or silent data loss. |
| 119 | +> - `Actions to Take`: |
| 120 | +> - Review Azure Monitor logs for error codes. |
| 121 | +> - Check Event Hubs quota limits and authentication credentials. |
| 122 | +> - Ensure the SDK is up to date and properly configured. |
| 123 | +> - Monitor retry logic and backoff settings. |
| 124 | +</details> |
| 125 | +
|
| 126 | +<details> |
| 127 | +<summary><strong>Throttled Requests > 0</strong></summary> |
| 128 | + |
| 129 | +> - `Interpretation`: Event Hubs is rejecting requests due to throughput limits being exceeded. |
| 130 | +> - `Why This Matters`: Throttling causes ingestion delays and can lead to dropped messages if retries aren't configured. |
| 131 | +> - `Actions to Take`: |
| 132 | +> - Scale up throughput units (TUs). |
| 133 | +> - Optimize producer batching and retry logic. |
| 134 | +> - Consider upgrading to Premium or Dedicated tier. |
| 135 | +> - Monitor partition distribution to avoid hotspots. |
| 136 | +</details> |
| 137 | +
|
| 138 | +<details> |
| 139 | +<summary><strong>Incoming Messages > Outgoing Messages</strong></summary> |
| 140 | + |
| 141 | +> - `Interpretation`: Consumers are lagging behind. This could be due to slow processing, misconfiguration, or disconnection. |
| 142 | +> - `Why This Matters`: Data accumulates in Event Hubs and delays downstream ingestion. This can lead to increased costs or message expiration. |
| 143 | +> - `Actions to Take`: |
| 144 | +> - Check consumer health and retry settings. |
| 145 | +> - Monitor partition lag and scale out consumer instances. |
| 146 | +> - Review consumer logs for processing delays. |
| 147 | +> - Ensure checkpointing is functioning correctly. |
| 148 | +</details> |
| 149 | +
|
| 150 | +<details> |
| 151 | +<summary><strong>Outgoing Messages = 0</strong></summary> |
| 152 | + |
| 153 | +> - `Interpretation`: No consumer is reading from Event Hubs. This may indicate a broken connection or inactive consumer. |
| 154 | +> - `Why This Matters`: If no one is reading, data will remain in Event Hubs and eventually expire or overflow. |
| 155 | +> - `Actions to Take`: |
| 156 | +> - Restart consumer services and validate connection strings. |
| 157 | +> - Check logs for authentication or network errors. |
| 158 | +> - Ensure consumer is subscribed to the correct Event Hub and partitions. |
| 159 | +> - Confirm that the consumer group is active and not blocked. |
| 160 | +</details> |
| 161 | +
|
| 162 | +<details> |
| 163 | +<summary><strong>Outgoing Bytes unusually high per message</strong></summary> |
| 164 | + |
| 165 | +> - `Interpretation`: Payloads are large, which may slow down delivery due to bandwidth constraints or inefficient serialization. |
| 166 | +> - `Why This Matters`: Large payloads can saturate the egress pipeline and delay consumer processing. |
| 167 | +> - `Actions to Take`: |
| 168 | +> - Compress payloads or reduce message size. |
| 169 | +> - Use efficient serialization formats. |
| 170 | +> - Increase consumer capacity or parallelism. |
| 171 | +> - Monitor egress bandwidth and adjust partitioning strategy. |
| 172 | +</details> |
| 173 | +
|
| 174 | +<details> |
| 175 | +<summary><strong>Capture Backlog increasing over time</strong></summary> |
| 176 | + |
| 177 | +> - `Interpretation`: Event Hubs is falling behind on writing to Blob Storage or Data Lake. |
| 178 | +> - `Why This Matters`: A growing backlog indicates that data is stuck in Event Hubs and not being flushed to storage. |
| 179 | +> - `Actions to Take`: |
| 180 | +> - Check storage account health and write permissions. |
| 181 | +> - Review capture configuration and partition mapping. |
| 182 | +> - Monitor Event Hubs throughput and scale if needed. |
| 183 | +> - Validate that the capture destination is reachable. |
| 184 | +</details> |
| 185 | +
|
| 186 | +<details> |
| 187 | +<summary><strong>Capture Backlog = 0 but Outgoing Messages still delayed</strong></summary> |
| 188 | + |
| 189 | +> - `Interpretation`: Event Hubs is writing to storage successfully, but consumers are still slow. |
| 190 | +> - `Why This Matters`: This isolates the delay to the consumer side, not Event Hubs or storage. |
| 191 | +> - `Actions to Take`: |
| 192 | +> - Investigate consumer performance and ingestion rate. |
| 193 | +> - Check for downstream bottlenecks or queue saturation. |
| 194 | +> - Review Splunk Connect logs and retry behavior. |
| 195 | +</details> |
| 196 | +
|
| 197 | +## Support and possible considerations |
| 198 | + |
| 199 | +<details> |
| 200 | +<summary><strong>Low throughput despite normal metrics</strong></summary> |
| 201 | + |
| 202 | +> - `Interpretation`: Event Hubs may be under-provisioned for your workload. |
| 203 | +> - `Why This Matters`: This condition can mask performance issues and lead to inconsistent delivery rates. |
| 204 | +> - `Actions to Take`: |
| 205 | +> - Review throughput unit (TU) allocation. |
| 206 | +> - Consider scaling up or moving to Premium/Dedicated tier. |
| 207 | +> - Check partition count and ensure load is evenly distributed. |
| 208 | +</details> |
| 209 | +
|
| 210 | +<details> |
| 211 | +<summary><strong>Region mismatch between Event Hubs and consumers</strong></summary> |
| 212 | + |
| 213 | +> - `Interpretation`: If Event Hubs and consumers are in different Azure regions, network latency can introduce delays. |
| 214 | +> - `Why This Matters`: Cross-region traffic increases latency and can degrade performance. |
| 215 | +> - `Actions to Take`: |
| 216 | +> - Align Event Hubs and consumer services to the same region. |
| 217 | +> - Use Azure Network Watcher to measure latency. |
| 218 | +> - Consider ExpressRoute or private endpoints for critical workloads. |
| 219 | +</details> |
| 220 | +
|
| 221 | +<details> |
| 222 | +<summary><strong>Partition imbalance</strong></summary> |
| 223 | + |
| 224 | +> - `Interpretation`: One partition receives significantly more traffic than others, creating a bottleneck. |
| 225 | +> - `Why This Matters`: Uneven partition load can lead to throttling, consumer lag, and inefficient resource usage. |
| 226 | +> - `Actions to Take`: |
| 227 | +> - Use a custom partitioning strategy or round-robin. |
| 228 | +> - Monitor per-partition metrics. |
| 229 | +> - Rebalance producers and ensure consumers are evenly distributed. |
| 230 | +</details> |
| 231 | +
|
| 232 | +<details> |
| 233 | +<summary><strong>Consumer checkpointing is delayed</strong></summary> |
| 234 | + |
| 235 | +> - `Interpretation`: Consumers aren’t checkpointing frequently, which may cause reprocessing or lag. |
| 236 | +> - `Why This Matters`: Delayed checkpointing increases latency and can cause duplicate processing. |
| 237 | +> - `Actions to Take`: |
| 238 | +> - Review checkpointing interval and logic. |
| 239 | +> - Ensure storage used for checkpoints is healthy. |
| 240 | +> - Monitor EventProcessorClient logs. |
| 241 | +> - Validate that checkpointing is enabled and functioning. |
| 242 | +</details> |
| 243 | +
|
| 244 | +<details> |
| 245 | +<summary><strong>Consumer SDK version is outdated</strong></summary> |
| 246 | + |
| 247 | +> - `Interpretation`: Older SDKs may have inefficient polling, retry logic, or lack performance features. |
| 248 | +> - `Why This Matters`: Using outdated SDKs can introduce latency and reduce reliability. |
| 249 | +> - `Actions to Take`: |
| 250 | +> - Upgrade to the latest Event Hubs SDK. |
| 251 | +> - Review changelogs for performance improvements. |
| 252 | +> - Validate retry and prefetch settings. |
| 253 | +> - Test ingestion performance after upgrade. |
| 254 | +</details> |
| 255 | +
|
| 256 | +<details> |
| 257 | +<summary><strong>Capture is enabled but unused</strong></summary> |
| 258 | + |
| 259 | +> - `Interpretation`: Event Hubs Capture is turned on, but the data is not being consumed or stored downstream. This can lead to unnecessary resource usage and potential confusion in diagnostics. |
| 260 | +> - `Why This Matters`: Capture consumes throughput and storage resources. If it's enabled but unused, it may affect performance or cost without delivering value. |
| 261 | +> - `Actions to Take`: |
| 262 | +> - Review whether Capture is needed for your workload. |
| 263 | +> - Disable Capture if it's not actively used for analytics or archival. |
| 264 | +> - Check storage account configuration to ensure it's not misconfigured or unreachable. |
| 265 | +> - Monitor Capture metrics to confirm data is being written and accessed. |
| 266 | +</details> |
| 267 | +
|
| 268 | + |
| 269 | +<!-- START BADGE --> |
| 270 | +<div align="center"> |
| 271 | + <img src="https://img.shields.io/badge/Total%20views-1443-limegreen" alt="Total views"> |
| 272 | + <p>Refresh Date: 2025-09-05</p> |
| 273 | +</div> |
| 274 | +<!-- END BADGE --> |
0 commit comments