Skip to content

Commit b3ee9af

Browse files
authored
more context about how tokenization works
1 parent d7d22e5 commit b3ee9af

1 file changed

Lines changed: 116 additions & 15 deletions

File tree

0_Azure/3_AzureAI/AIFoundry/demos/4_TruncationHandling.md

Lines changed: 116 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -41,61 +41,129 @@ Last updated: 2025-03-03
4141
| Tokenizer Behavior | Azure OpenAI uses subword tokenization (e.g., `tiktoken`) | Complex or rare words may consume more tokens than expected |
4242
| Verbose or Tangential Output| High temperature settings cause longer, less focused completions | May exceed token limits and truncate output mid-thought |
4343

44+
4445
<details>
4546
<summary><b> Structural Complexity </b> (Click to expand)</summary>
4647

4748
> Documents with **conditional logic**, **nested clauses**, or **sparse named entities** are structurally complex. These patterns confuse tokenizers because they lack clear semantic anchors (like names or dates) and often involve long, interdependent clauses.
4849
49-
> E.g: `If the system fails to initialize, and the fallback protocol is not triggered unless the override is active, then the watchdog timer must be reset manually.`
50+
> E.g: `If the system fails to initialize, and the fallback protocol is not triggered unless the override is active, then the watchdog timer must be reset manually.`
5051
> This sentence, while not long, contains multiple conditions and dependencies. Tokenizers break it into many subword units, inflating the token count.
5152
52-
> Why It Matter?
53+
> Why It Matters?
5354
5455
- You may hit token limits even with seemingly short documents.
5556
- Truncation may occur mid-sentence or mid-logic, leading to incomplete or incoherent outputs.
57+
- Azure OpenAI’s tokenizer (`tiktoken`) breaks text into subword units, so structurally dense content can consume more tokens than expected.
58+
- Complex documents often lack named entities (e.g., people, places, dates), which are helpful for grounding and compressing meaning efficiently.
5659

5760
> How to Address?
5861
59-
- Use **semantic chunking** to isolate logical units (e.g., one condition per chunk).
60-
- Preprocess documents to simplify or flatten nested logic where possible.
62+
- Use **semantic chunking** to isolate logical units (e.g., one condition per chunk). In Azure, this can be implemented using:
63+
- **Azure AI Search’s Document Layout skill** to chunk by paragraphs, headings, or sections.
64+
- **Text Split skill** to define chunk size and overlap, preserving context across boundaries.
65+
- Example configuration:
66+
```json
67+
{
68+
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
69+
"textSplitMode": "pages",
70+
"maximumPageLength": 800,
71+
"overlappingLength": 100
72+
}
73+
```
74+
75+
- Preprocess documents to simplify or flatten nested logic where possible:
76+
- Use Azure Functions or Logic Apps to transform complex conditionals into simpler declarative statements or bullet points.
77+
- Example transformation:
78+
- Original:
79+
`If A and B, unless C, then D.`
80+
- Flattened:
81+
- Condition 1: A is true
82+
- Condition 2: B is true
83+
- Exception: C is false
84+
- Action: Perform D
85+
86+
- Use **token-aware chunking** before sending content to Azure OpenAI:
87+
- Deploy a preprocessing step using `tiktoken` in an Azure Function to:
88+
- Count tokens per clause or paragraph
89+
- Split content into ≤3000-token chunks
90+
- Return token-safe chunks to Azure OpenAI for inference
91+
- This ensures that each chunk respects token limits and avoids mid-logic truncation.
92+
93+
- Monitor token usage and truncation patterns using **Azure Monitor** and **Log Analytics**:
94+
- Track metrics like `tokens_used`, `completion_tokens`, and `prompt_tokens`.
95+
- Set alerts for high token usage or frequent truncation errors.
6196

6297
</details>
6398

6499
<details>
65100
<summary><b> Tokenizer Behavior </b> (Click to expand)</summary>
66101

67-
> Azure OpenAI uses the same tokenizer as OpenAI, typically `tiktoken`. This tokenizer breaks text into **subword tokens**, not full words. For example:
102+
> Azure OpenAI uses the same tokenizer as OpenAI, typically `tiktoken`. This tokenizer breaks text into **subword tokens**, not full words. For example:
68103
> - “Initialization” → `["Initial", "ization"]`
69104
> - “FallbackProtocol” → `["Fallback", "Protocol"]`
70105

71-
> Complex syntax, rare words, or compound identifiers (like in code or legal text) often result in more tokens per word than expected.
106+
> Complex syntax, rare words, or compound identifiers (like in code, legal, or scientific text) often result in more tokens per word than expected. This is especially common in enterprise documents with domain-specific terminology, acronyms, or camelCase identifiers.
72107

73108
> **Why It Matters**
74109

75-
- Token count can balloon unexpectedly, even in short or medium-length documents.
76-
- This can lead to premature truncation or rejection of prompts that exceed model limits.
110+
- Token count can balloon unexpectedly, even in short or medium-length documents.
111+
- This can lead to:
112+
- Premature truncation of outputs.
113+
- Rejection of prompts that exceed model limits (e.g., 128k for GPT-4-128k).
114+
- Increased latency and cost due to inefficient token usage.
115+
- Token inflation is especially problematic in Azure OpenAI when using models in high-throughput or stateless scenarios, where every token counts toward performance and billing.
77116

78117
> **How to Address**
79118

80-
- Use the `tiktoken` library to **pre-calculate token usage** before sending prompts.
81-
- Normalize or simplify text during preprocessing (e.g., split compound words).
82-
- Avoid overly technical phrasing unless necessary.
119+
- Use the `tiktoken` library to **pre-calculate token usage** before sending prompts to Azure OpenAI:
120+
- Deploy this as part of a preprocessing pipeline in an **Azure Function** or **Logic App**.
121+
- Example:
122+
```python
123+
import tiktoken
124+
enc = tiktoken.encoding_for_model("gpt-4")
125+
tokens = enc.encode("Your input text here")
126+
print(len(tokens))
127+
```
128+
129+
- Normalize or simplify text during preprocessing:
130+
- Replace compound identifiers like `FallbackProtocol` with `fallback protocol`.
131+
- Convert camelCase or snake_case to plain language equivalents.
132+
- Remove unnecessary jargon or abbreviations unless essential.
133+
134+
- Avoid overly technical phrasing unless required:
135+
- Instead of:
136+
`The system's failoverInitFlag must be set to true unless the watchdogOverride is active.`
137+
- Use:
138+
`The system must fail over unless the watchdog override is active.`
139+
140+
- Use **Azure AI Search** to preprocess and chunk documents before embedding:
141+
- The **Text Split skill** can help break down dense content into manageable, semantically meaningful units.
142+
- Combine this with token-aware chunking to ensure each chunk stays within safe token limits.
143+
144+
- Monitor token usage in production:
145+
- Use **Azure Monitor** and **Log Analytics** to track `prompt_tokens`, `completion_tokens`, and `total_tokens`.
146+
- Set alerts for unusually high token usage or truncation errors.
83147

84148
</details>
85149

86150
<details>
87151
<summary><b> Verbose or Tangential Output </b> (Click to expand)</summary>
88152

89-
> The `temperature` parameter controls randomness in model output:
153+
> The `temperature` parameter controls randomness in model output:
90154
> - **High temperature (0.8–1.0)** → creative, verbose, tangential
91155
> - **Low temperature (0.2–0.4)** → focused, deterministic, concise
92156

93157
> High temperature can cause the model to “ramble”, using more tokens than necessary and increasing the risk of hitting token limits.
94158

95159
> **Why It Matters**
96160

97-
- Verbose completions may exceed token budgets, especially in stateless or high-throughput scenarios.
161+
- Verbose completions may exceed token budgets, especially in stateless or high-throughput scenarios.
98162
- Truncation may occur mid-sentence or mid-thought, degrading output quality.
163+
- In Azure OpenAI, token usage directly impacts:
164+
- **Latency**: More tokens = longer processing time.
165+
- **Cost**: You are billed per token used.
166+
- **Reliability**: Long outputs are more likely to hit model limits or timeout in high-load environments.
99167

100168
> **How to Address**
101169

@@ -106,11 +174,44 @@ Last updated: 2025-03-03
106174
"top_p": 0.9
107175
}
108176
```
109-
- Use `max_tokens` to cap output length.
110-
- Define `stop` sequences to cut off output at logical boundaries (e.g., `["\n\n", "###"]`).
177+
- This configuration ensures the model stays focused and avoids unnecessary elaboration.
178+
- `top_p` (nucleus sampling) helps limit the range of token choices, further reducing verbosity.
179+
180+
- Use `max_tokens` to cap output length:
181+
```json
182+
{
183+
"max_tokens": 1500
184+
}
185+
```
186+
- This prevents the model from generating excessively long responses.
187+
- In Azure OpenAI Studio, you can set this in the deployment playground or via API.
188+
189+
- Define `stop` sequences to cut off output at logical boundaries:
190+
```json
191+
{
192+
"stop": ["\n\n", "###", "END"]
193+
}
194+
```
195+
- This is especially useful when generating structured outputs like JSON, YAML, or bullet lists.
196+
- It ensures the model stops cleanly instead of trailing off or repeating.
197+
198+
- In Azure OpenAI deployments:
199+
- Use **deployment-level defaults** for temperature and `max_tokens` to enforce consistency across applications.
200+
- For example, in Azure OpenAI Studio, under your deployment settings, configure:
201+
- `temperature = 0.3`
202+
- `max_tokens = 1024`
203+
- `frequency_penalty = 0.2` (optional, to reduce repetition)
204+
205+
- Monitor and tune:
206+
- Use **Azure Monitor** and **Application Insights** to track:
207+
- Average token usage per request
208+
- Frequency of truncation errors
209+
- Latency spikes due to verbose completions
210+
- Adjust temperature and `max_tokens` dynamically based on usage patterns.
111211

112212
</details>
113213

214+
114215
## How to resolve truncation issues
115216

116217
| **Solution Area** | **Action** | **Why It Helps** |

0 commit comments

Comments
 (0)