Skip to content

Commit 34b5444

Browse files
authored
format + monitoring
1 parent b3ee9af commit 34b5444

1 file changed

Lines changed: 199 additions & 24 deletions

File tree

0_Azure/3_AzureAI/AIFoundry/demos/4_TruncationHandling.md

Lines changed: 199 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -211,7 +211,6 @@ Last updated: 2025-03-03
211211

212212
</details>
213213

214-
215214
## How to resolve truncation issues
216215

217216
| **Solution Area** | **Action** | **Why It Helps** |
@@ -222,43 +221,110 @@ Last updated: 2025-03-03
222221
| Output Constraints | Use `max_tokens`, `stop` sequences, and `top_p` in Azure OpenAI API calls | Ensures clean, bounded outputs and avoids mid-sentence truncation |
223222
| Monitoring & Scaling | Use Azure Monitor, Log Analytics, and PTUs for throughput and cost control | Enables observability and resilience at enterprise scale |
224223

224+
225225
<details>
226226
<summary><b> Token Budgeting in Azure </b> (Click to expand)</summary>
227227

228-
> Azure OpenAI models like GPT-4-128k enforce strict token limits. Complex documents with nested logic or rare terms can tokenize inefficiently, leading to unexpected truncation. Use an `Azure Function or Logic App` with the `tiktoken` library to analyze and split documents into token-aware chunks before sending them to Azure OpenAI.
228+
> Azure OpenAI models like GPT-4-128k enforce strict token limits. Complex documents with nested logic or rare terms can tokenize inefficiently, leading to unexpected truncation. Use an `Azure Function` or `Logic App` with the `tiktoken` library to analyze and split documents into token-aware chunks before sending them to Azure OpenAI.
229+
230+
> E.g: A 5-page technical document with nested conditions and domain-specific terms may tokenize into 40,000+ tokens, well beyond expectations, because of how `tiktoken` breaks down compound words and logic-heavy phrasing.
231+
232+
> **Why It Matters**
233+
234+
- Token limits are hard constraints in Azure OpenAI. If your prompt + completion exceeds the model’s token ceiling (e.g., 128k for GPT-4-128k), the request will fail or be truncated.
235+
- Token inflation from complex syntax can cause:
236+
- Incomplete outputs
237+
- Higher latency
238+
- Increased cost per request
239+
- Without token budgeting, you risk unpredictable behavior in production, especially in stateless or high-throughput applications.
229240

230-
**How to Apply in Azure:**
241+
> **How to Address**
231242
232-
- Deploy a lightweight Azure Function that:
233-
- Accepts document input
234-
- Uses `tiktoken` to count tokens
235-
- Splits content into ≤3000-token chunks
236-
- Returns chunks to Power Automate or Azure OpenAI for inference
243+
- Use `tiktoken` in an Azure-native preprocessing pipeline to **count and manage tokens**:
244+
- Deploy an **Azure Function** that:
245+
- Accepts raw document input (from Blob Storage, SharePoint, etc.)
246+
- Uses `tiktoken` to calculate token count
247+
- Splits content into ≤3000-token chunks (or your preferred threshold)
248+
- Returns token-safe chunks to Azure OpenAI or Power Automate
249+
250+
Example Python snippet:
251+
```python
252+
import tiktoken
253+
enc = tiktoken.encoding_for_model("gpt-4")
254+
tokens = enc.encode(your_text)
255+
print(len(tokens)) # Use this to enforce chunking logic
256+
```
257+
258+
- Integrate with **Power Automate**:
259+
- Trigger the Azure Function from a flow
260+
- Loop through returned chunks and call Azure OpenAI for each
261+
- Aggregate responses if needed (e.g., for summarization or document Q&A)
262+
263+
- Use **Azure AI Search** for chunking + embedding:
264+
- Apply the **Text Split skill** with `maximumPageLength` and `overlappingLength` to control chunk size and preserve context.
265+
- Combine with the **Document Layout skill** to chunk by paragraphs or sections.
266+
267+
Example skill configuration:
268+
```json
269+
{
270+
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
271+
"textSplitMode": "pages",
272+
"maximumPageLength": 800,
273+
"overlappingLength": 100
274+
}
275+
```
237276

238-
**Monitoring:**
239-
- Use Azure Monitor and Log Analytics to track:
240-
- `tokens_used`
241-
- `flowRunId`
242-
- `request_uri`
243-
- Visualize trends in Power BI to detect spikes or anomalies
277+
> **Monitoring**
278+
279+
- Use **Azure Monitor** and **Log Analytics** to track:
280+
- `tokens_used` per request
281+
- `flowRunId` and `request_uri` for traceability
282+
- `completion_tokens` vs `prompt_tokens` to optimize prompt design
283+
284+
- Visualize token trends in **Power BI**:
285+
- Identify documents or use cases with high token consumption
286+
- Detect anomalies like sudden spikes or frequent truncation errors
287+
288+
- Set alerts for:
289+
- Token usage thresholds
290+
- Truncation-related errors
291+
- Latency spikes due to oversized prompts
244292

245293
</details>
246294

247295
<details>
248296
<summary><b> Semantic Chunking with Azure AI Search </b> (Click to expand)</summary>
249297

250-
> Azure `AI Search` supports semantic chunking via built-in skills like `Document Layout and Text Split`. These tools preserve logical structure and improve retrieval quality for RAG pipelines. `Chunking is not just about staying under token limits—it also improves embedding quality and relevance scoring.` Click here to read more about [Chunk large documents for vector search solutions in Azure AI Search](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents)
298+
> Azure `AI Search` supports semantic chunking via built-in skills like `Document Layout` and `Text Split`. These tools preserve logical structure and improve retrieval quality for RAG pipelines. `Chunking is not just about staying under token limits, it also improves embedding quality and relevance scoring.`
299+
> Click here to read more about [Chunk large documents for vector search solutions in Azure AI Search](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents)
300+
301+
> E.g: A 10-page policy document with multiple sections and subheadings can be chunked by paragraph or heading using the Document Layout skill. This ensures each chunk contains a coherent idea, improving both retrieval accuracy and LLM comprehension.
302+
303+
> **Why It Matters**
304+
305+
- Fixed-size chunking (e.g., every 1,024 tokens) can split content mid-sentence or mid-thought, leading to:
306+
- Poor embedding quality
307+
- Irrelevant or incoherent retrieval results
308+
- Increased token usage due to repeated context
309+
- Semantic chunking ensures each chunk is logically complete, which:
310+
- Improves the quality of vector embeddings
311+
- Increases the relevance of search results in RAG pipelines
312+
- Reduces the likelihood of truncation or hallucination in LLM responses
313+
314+
> **How to Apply in Azure**
251315
252-
**How to Apply in Azure:**
253316
- Use the **Document Layout skill** to chunk by:
254317
- Paragraphs
255-
- Headings (e.g., Markdown or HTML)
318+
- Headings (e.g., Markdown, HTML, or PDF structure)
256319
- Tables or sections
320+
- This skill extracts structural elements from documents and enables chunking based on layout rather than raw text length.
321+
257322
- Use the **Text Split skill** to:
258-
- Split by sentence or character count
259-
- Add 10–15% overlap between chunks
323+
- Split by sentence, character count, or page length
324+
- Add 10–15% overlap between chunks to preserve context across boundaries
325+
- This is especially useful when layout metadata is unavailable or inconsistent
260326

261-
**Example Configuration:**
327+
> **Example Configuration**
262328
263329
```json
264330
{
@@ -273,15 +339,42 @@ Last updated: 2025-03-03
273339
}
274340
```
275341

342+
- Combine both skills in a skillset pipeline:
343+
- First, apply the Document Layout skill to extract structured content
344+
- Then, apply the Text Split skill to chunk the structured content into token-efficient, semantically meaningful segments
345+
346+
> **Additional Tips**
347+
348+
- Store both the chunk content and its source metadata (e.g., heading, section title) in your Azure AI Search index. This improves filtering and reranking.
349+
- Use chunk-level embeddings for vector search and rerank results using Azure AI Search’s semantic scoring.
350+
- Evaluate chunking effectiveness by measuring:
351+
- Retrieval precision (e.g., top-k relevance)
352+
- LLM response quality
353+
- Token efficiency per query
354+
276355
</details>
277356

357+
278358
<details>
279359
<summary><b> Temperature & Output Control in Azure OpenAI </b> (Click to expand)</summary>
280360

281361
> High temperature values (e.g., 0.8–1.0) increase creativity but also verbosity, which can lead to token overflow. Lower values (e.g., 0.2–0.4) yield more concise, deterministic outputs. Combine temperature control with `top_p`, `stop` sequences, and `max_tokens` in your Azure OpenAI deployment or API call. Click here to read more about [What is Azure OpenAI in Azure AI Foundry Models?](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/overview)
282362
283-
**How to Apply in Azure:**
284-
- In Azure OpenAI Studio or API:
363+
> E.g: A prompt like “Summarize the following policy document” with a high temperature (0.9) might result in a long, overly creative response that includes speculative or unnecessary elaboration. The same prompt with a temperature of 0.2 will yield a more focused and concise summary.
364+
365+
> **Why It Matters**
366+
367+
- Verbose completions may exceed token budgets, especially in stateless or high-throughput scenarios.
368+
- Truncation may occur mid-sentence or mid-thought, degrading output quality and user trust.
369+
- In Azure OpenAI, token usage directly affects:
370+
- **Latency**: More tokens = longer processing time.
371+
- **Cost**: You are billed per token used.
372+
- **Reliability**: Long outputs are more likely to hit model limits or timeout in high-load environments.
373+
- Inconsistent output behavior can also affect downstream systems that rely on structured or predictable responses (e.g., JSON, YAML, or tabular formats).
374+
375+
> **How to Address**
376+
377+
- In Azure OpenAI Studio or API, configure your deployment or request with:
285378
```json
286379
{
287380
"temperature": 0.3,
@@ -290,15 +383,97 @@ Last updated: 2025-03-03
290383
"stop": ["\n\n", "###", "END"]
291384
}
292385
```
386+
- `temperature`: Controls randomness. Lower = more deterministic.
387+
- `top_p`: Limits token sampling to the top probability mass. Helps reduce tangents.
388+
- `max_tokens`: Caps output length to avoid overflow.
389+
- `stop`: Ensures clean termination of output at logical boundaries.
293390

294391
- For stateless, high-throughput scenarios:
295-
- Use Provisioned Throughput Units (PTUs) for predictable performance
296-
- Monitor latency and token usage with Azure Monitor.
392+
- Use **Provisioned Throughput Units (PTUs)** in Azure OpenAI to guarantee consistent performance under load.
393+
- PTUs allow you to reserve capacity and avoid throttling during peak usage.
394+
- Ideal for APIs that serve thousands of requests per minute without chat history.
395+
396+
- Monitor and tune:
397+
- Use **Azure Monitor** and **Application Insights** to track:
398+
- `prompt_tokens`, `completion_tokens`, and `total_tokens`
399+
- Latency and response time
400+
- Truncation errors or incomplete outputs
401+
- Adjust `temperature`, `max_tokens`, and `stop` sequences based on observed behavior.
402+
403+
- For structured tasks (e.g., summarization, classification, extraction):
404+
- Keep `temperature` between 0.2–0.4
405+
- Use `frequency_penalty` and `presence_penalty` to reduce repetition or encourage novelty when needed:
406+
```json
407+
{
408+
"frequency_penalty": 0.2,
409+
"presence_penalty": 0.0
410+
}
411+
```
412+
413+
</details>
414+
415+
<details>
416+
<summary><b> Monitoring & Scaling </b> (Click to expand)</summary>
417+
418+
> Azure OpenAI workloads, especially those involving high-throughput or production-grade LLM applications—require robust monitoring and scaling strategies. Azure provides native tools like `Azure Monitor`, `Log Analytics`, and `Provisioned Throughput Units (PTUs)` to help you track performance, control costs, and ensure reliability at scale.
419+
420+
> E.g: A stateless summarization API that receives thousands of requests per minute may experience latency spikes or throttling if not backed by PTUs and monitored for token usage and response time. Without observability, it’s difficult to detect when truncation or cost overruns occur.
421+
422+
> **Why It Matters**
423+
424+
- Azure OpenAI usage is billed per token. Without monitoring, you may:
425+
- Exceed budget thresholds
426+
- Miss early signs of inefficient prompts or verbose completions
427+
- High-traffic applications can experience:
428+
- Throttling or rate-limiting
429+
- Latency spikes
430+
- Inconsistent performance without PTUs
431+
- Lack of observability can delay detection of:
432+
- Truncation errors
433+
- Prompt design flaws
434+
- Token inflation due to structural complexity
435+
436+
> **How to Address**
437+
438+
- Use **Azure Monitor** to track key metrics:
439+
- `tokens_used`, `prompt_tokens`, `completion_tokens`
440+
- `latency`, `success_rate`, `throttled_requests`
441+
- Set up alerts for anomalies (e.g., token spikes, high latency)
442+
443+
- Use **Log Analytics** to:
444+
- Query historical usage patterns
445+
- Correlate token usage with specific endpoints or flows
446+
- Analyze prompt efficiency and model behavior over time
447+
448+
- Visualize trends in **Power BI**:
449+
- Connect to Log Analytics workspace
450+
- Build dashboards for:
451+
- Token consumption by endpoint
452+
- Cost per request
453+
- Truncation frequency
454+
455+
- Use **Provisioned Throughput Units (PTUs)** for predictable performance:
456+
- PTUs reserve dedicated capacity for Azure OpenAI deployments
457+
- Ideal for:
458+
- Stateless APIs
459+
- Batch processing
460+
- Real-time inference at scale
461+
- PTUs help avoid throttling and ensure consistent latency
462+
463+
> PTUs are especially useful when using Azure OpenAI in conjunction with Azure AI Search or Power Automate, where downstream systems depend on consistent response times.
464+
465+
- Monitor PTU usage:
466+
- Use Azure Metrics Explorer to track:
467+
- PTU consumption
468+
- Cache hit ratio
469+
- Request volume
470+
- Adjust PTU allocation based on observed demand
297471

298472
</details>
299473

300474

301475

476+
302477
<div align="center">
303478
<h3 style="color: #4CAF50;">Total Visitors</h3>
304479
<img src="https://profile-counter.glitch.me/brown9804/count.svg" alt="Visitor Count" style="border: 2px solid #4CAF50; border-radius: 5px; padding: 5px;"/>

0 commit comments

Comments
 (0)