You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Output Constraints | Use `max_tokens`, `stop` sequences, and `top_p` in Azure OpenAI API calls | Ensures clean, bounded outputs and avoids mid-sentence truncation |
223
222
| Monitoring & Scaling | Use Azure Monitor, Log Analytics, and PTUs for throughput and cost control | Enables observability and resilience at enterprise scale |
224
223
224
+
225
225
<details>
226
226
<summary><b> Token Budgeting in Azure </b> (Click to expand)</summary>
227
227
228
-
> Azure OpenAI models like GPT-4-128k enforce strict token limits. Complex documents with nested logic or rare terms can tokenize inefficiently, leading to unexpected truncation. Use an `Azure Function or Logic App` with the `tiktoken` library to analyze and split documents into token-aware chunks before sending them to Azure OpenAI.
228
+
> Azure OpenAI models like GPT-4-128k enforce strict token limits. Complex documents with nested logic or rare terms can tokenize inefficiently, leading to unexpected truncation. Use an `Azure Function` or `Logic App` with the `tiktoken` library to analyze and split documents into token-aware chunks before sending them to Azure OpenAI.
229
+
230
+
> E.g: A 5-page technical document with nested conditions and domain-specific terms may tokenize into 40,000+ tokens, well beyond expectations, because of how `tiktoken` breaks down compound words and logic-heavy phrasing.
231
+
232
+
> **Why It Matters**
233
+
234
+
- Token limits are hard constraints in Azure OpenAI. If your prompt + completion exceeds the model’s token ceiling (e.g., 128k for GPT-4-128k), the request will fail or be truncated.
235
+
- Token inflation from complex syntax can cause:
236
+
- Incomplete outputs
237
+
- Higher latency
238
+
- Increased cost per request
239
+
- Without token budgeting, you risk unpredictable behavior in production, especially in stateless or high-throughput applications.
229
240
230
-
**How to Apply in Azure:**
241
+
> **How to Address**
231
242
232
-
- Deploy a lightweight Azure Function that:
233
-
- Accepts document input
234
-
- Uses `tiktoken` to count tokens
235
-
- Splits content into ≤3000-token chunks
236
-
- Returns chunks to Power Automate or Azure OpenAI for inference
243
+
- Use `tiktoken` in an Azure-native preprocessing pipeline to **count and manage tokens**:
244
+
- Deploy an **Azure Function** that:
245
+
- Accepts raw document input (from Blob Storage, SharePoint, etc.)
246
+
- Uses `tiktoken` to calculate token count
247
+
- Splits content into ≤3000-token chunks (or your preferred threshold)
248
+
- Returns token-safe chunks to Azure OpenAI or Power Automate
249
+
250
+
Example Python snippet:
251
+
```python
252
+
import tiktoken
253
+
enc = tiktoken.encoding_for_model("gpt-4")
254
+
tokens = enc.encode(your_text)
255
+
print(len(tokens)) # Use this to enforce chunking logic
256
+
```
257
+
258
+
- Integrate with **Power Automate**:
259
+
- Trigger the Azure Function from a flow
260
+
- Loop through returned chunks and call Azure OpenAI for each
261
+
- Aggregate responses if needed (e.g., for summarization or document Q&A)
262
+
263
+
- Use **Azure AI Search** for chunking + embedding:
264
+
- Apply the **Text Split skill** with `maximumPageLength` and `overlappingLength` to control chunk size and preserve context.
265
+
- Combine with the **Document Layout skill** to chunk by paragraphs or sections.
- Visualize trends in Power BI to detect spikes or anomalies
277
+
> **Monitoring**
278
+
279
+
- Use **Azure Monitor** and **Log Analytics** to track:
280
+
-`tokens_used` per request
281
+
-`flowRunId` and `request_uri` for traceability
282
+
-`completion_tokens` vs `prompt_tokens` to optimize prompt design
283
+
284
+
- Visualize token trends in **Power BI**:
285
+
- Identify documents or use cases with high token consumption
286
+
- Detect anomalies like sudden spikes or frequent truncation errors
287
+
288
+
- Set alerts for:
289
+
- Token usage thresholds
290
+
- Truncation-related errors
291
+
- Latency spikes due to oversized prompts
244
292
245
293
</details>
246
294
247
295
<details>
248
296
<summary><b> Semantic Chunking with Azure AI Search </b> (Click to expand)</summary>
249
297
250
-
> Azure `AI Search` supports semantic chunking via built-in skills like `Document Layout and Text Split`. These tools preserve logical structure and improve retrieval quality for RAG pipelines. `Chunking is not just about staying under token limits—it also improves embedding quality and relevance scoring.` Click here to read more about [Chunk large documents for vector search solutions in Azure AI Search](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents)
298
+
> Azure `AI Search` supports semantic chunking via built-in skills like `Document Layout` and `Text Split`. These tools preserve logical structure and improve retrieval quality for RAG pipelines. `Chunking is not just about staying under token limits, it also improves embedding quality and relevance scoring.`
299
+
> Click here to read more about [Chunk large documents for vector search solutions in Azure AI Search](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents)
300
+
301
+
> E.g: A 10-page policy document with multiple sections and subheadings can be chunked by paragraph or heading using the Document Layout skill. This ensures each chunk contains a coherent idea, improving both retrieval accuracy and LLM comprehension.
302
+
303
+
> **Why It Matters**
304
+
305
+
- Fixed-size chunking (e.g., every 1,024 tokens) can split content mid-sentence or mid-thought, leading to:
306
+
- Poor embedding quality
307
+
- Irrelevant or incoherent retrieval results
308
+
- Increased token usage due to repeated context
309
+
- Semantic chunking ensures each chunk is logically complete, which:
310
+
- Improves the quality of vector embeddings
311
+
- Increases the relevance of search results in RAG pipelines
312
+
- Reduces the likelihood of truncation or hallucination in LLM responses
313
+
314
+
> **How to Apply in Azure**
251
315
252
-
**How to Apply in Azure:**
253
316
- Use the **Document Layout skill** to chunk by:
254
317
- Paragraphs
255
-
- Headings (e.g., Markdownor HTML)
318
+
- Headings (e.g., Markdown, HTML, or PDF structure)
256
319
- Tables or sections
320
+
- This skill extracts structural elements from documents and enables chunking based on layout rather than raw text length.
321
+
257
322
- Use the **Text Split skill** to:
258
-
- Split by sentence or character count
259
-
- Add 10–15% overlap between chunks
323
+
- Split by sentence, character count, or page length
324
+
- Add 10–15% overlap between chunks to preserve context across boundaries
325
+
- This is especially useful when layout metadata is unavailable or inconsistent
260
326
261
-
**Example Configuration:**
327
+
> **Example Configuration**
262
328
263
329
```json
264
330
{
@@ -273,15 +339,42 @@ Last updated: 2025-03-03
273
339
}
274
340
```
275
341
342
+
- Combine both skills in a skillset pipeline:
343
+
- First, apply the Document Layout skill to extract structured content
344
+
- Then, apply the Text Split skill to chunk the structured content into token-efficient, semantically meaningful segments
345
+
346
+
> **Additional Tips**
347
+
348
+
- Store both the chunk content and its source metadata (e.g., heading, section title) in your Azure AI Search index. This improves filtering and reranking.
349
+
- Use chunk-level embeddings for vector search and rerank results using Azure AI Search’s semantic scoring.
350
+
- Evaluate chunking effectiveness by measuring:
351
+
- Retrieval precision (e.g., top-k relevance)
352
+
- LLM response quality
353
+
- Token efficiency per query
354
+
276
355
</details>
277
356
357
+
278
358
<details>
279
359
<summary><b> Temperature & Output Control in Azure OpenAI </b> (Click to expand)</summary>
280
360
281
361
> High temperature values (e.g., 0.8–1.0) increase creativity but also verbosity, which can lead to token overflow. Lower values (e.g., 0.2–0.4) yield more concise, deterministic outputs. Combine temperature control with `top_p`, `stop` sequences, and `max_tokens` in your Azure OpenAI deployment or API call. Click here to read more about [What is Azure OpenAI in Azure AI Foundry Models?](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/overview)
282
362
283
-
**How to Apply in Azure:**
284
-
- In Azure OpenAI Studio or API:
363
+
> E.g: A prompt like “Summarize the following policy document” with a high temperature (0.9) might result in a long, overly creative response that includes speculative or unnecessary elaboration. The same prompt with a temperature of 0.2 will yield a more focused and concise summary.
364
+
365
+
> **Why It Matters**
366
+
367
+
- Verbose completions may exceed token budgets, especially in stateless or high-throughput scenarios.
368
+
- Truncation may occur mid-sentence or mid-thought, degrading output quality and user trust.
369
+
- In Azure OpenAI, token usage directly affects:
370
+
-**Latency**: More tokens = longer processing time.
371
+
-**Cost**: You are billed per token used.
372
+
-**Reliability**: Long outputs are more likely to hit model limits or timeout in high-load environments.
373
+
- Inconsistent output behavior can also affect downstream systems that rely on structured or predictable responses (e.g., JSON, YAML, or tabular formats).
374
+
375
+
> **How to Address**
376
+
377
+
- In Azure OpenAI Studio or API, configure your deployment or request with:
285
378
```json
286
379
{
287
380
"temperature": 0.3,
@@ -290,15 +383,97 @@ Last updated: 2025-03-03
290
383
"stop": ["\n\n", "###", "END"]
291
384
}
292
385
```
386
+
-`temperature`: Controls randomness. Lower = more deterministic.
387
+
-`top_p`: Limits token sampling to the top probability mass. Helps reduce tangents.
388
+
-`max_tokens`: Caps output length to avoid overflow.
389
+
-`stop`: Ensures clean termination of output at logical boundaries.
293
390
294
391
- For stateless, high-throughput scenarios:
295
-
- Use Provisioned Throughput Units (PTUs) for predictable performance
296
-
- Monitor latency and token usage with Azure Monitor.
392
+
- Use **Provisioned Throughput Units (PTUs)** in Azure OpenAI to guarantee consistent performance under load.
393
+
- PTUs allow you to reserve capacity and avoid throttling during peak usage.
394
+
- Ideal for APIs that serve thousands of requests per minute without chat history.
395
+
396
+
- Monitor and tune:
397
+
- Use **Azure Monitor** and **Application Insights** to track:
398
+
-`prompt_tokens`, `completion_tokens`, and `total_tokens`
399
+
- Latency and response time
400
+
- Truncation errors or incomplete outputs
401
+
- Adjust `temperature`, `max_tokens`, and `stop` sequences based on observed behavior.
402
+
403
+
- For structured tasks (e.g., summarization, classification, extraction):
404
+
- Keep `temperature` between 0.2–0.4
405
+
- Use `frequency_penalty` and `presence_penalty` to reduce repetition or encourage novelty when needed:
406
+
```json
407
+
{
408
+
"frequency_penalty": 0.2,
409
+
"presence_penalty": 0.0
410
+
}
411
+
```
412
+
413
+
</details>
414
+
415
+
<details>
416
+
<summary><b> Monitoring & Scaling </b> (Click to expand)</summary>
417
+
418
+
> Azure OpenAI workloads, especially those involving high-throughput or production-grade LLM applications—require robust monitoring and scaling strategies. Azure provides native tools like `Azure Monitor`, `Log Analytics`, and `Provisioned Throughput Units (PTUs)` to help you track performance, control costs, and ensure reliability at scale.
419
+
420
+
> E.g: A stateless summarization API that receives thousands of requests per minute may experience latency spikes or throttling if not backed by PTUs and monitored for token usage and response time. Without observability, it’s difficult to detect when truncation or cost overruns occur.
421
+
422
+
> **Why It Matters**
423
+
424
+
- Azure OpenAI usage is billed per token. Without monitoring, you may:
425
+
- Exceed budget thresholds
426
+
- Miss early signs of inefficient prompts or verbose completions
- Set up alerts for anomalies (e.g., token spikes, high latency)
442
+
443
+
- Use **Log Analytics** to:
444
+
- Query historical usage patterns
445
+
- Correlate token usage with specific endpoints or flows
446
+
- Analyze prompt efficiency and model behavior over time
447
+
448
+
- Visualize trends in **Power BI**:
449
+
- Connect to Log Analytics workspace
450
+
- Build dashboards for:
451
+
- Token consumption by endpoint
452
+
- Cost per request
453
+
- Truncation frequency
454
+
455
+
- Use **Provisioned Throughput Units (PTUs)** for predictable performance:
456
+
- PTUs reserve dedicated capacity for Azure OpenAI deployments
457
+
- Ideal for:
458
+
- Stateless APIs
459
+
- Batch processing
460
+
- Real-time inference at scale
461
+
- PTUs help avoid throttling and ensure consistent latency
462
+
463
+
> PTUs are especially useful when using Azure OpenAI in conjunction with Azure AI Search or Power Automate, where downstream systems depend on consistent response times.
0 commit comments