Skip to content

Commit 9747df3

Browse files
authored
overview
1 parent 9d7c82a commit 9747df3

1 file changed

Lines changed: 115 additions & 0 deletions

File tree

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# Truncation Handling for Complex Documents - Overview
2+
3+
Costa Rica
4+
5+
[![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/)
6+
[brown9804](https://github.com/brown9804)
7+
8+
Last updated: 2025-03-03
9+
10+
------------------------------------------
11+
12+
> Truncation often results from exceeding token limits or poor chunking strategies. Complex documents (those with conditional logic, sparse entities, or nested structures) can tokenize inefficiently.
13+
14+
<details>
15+
<summary><b>List of References</b> (Click to expand)</summary>
16+
17+
18+
</details>
19+
20+
21+
<details>
22+
<summary><b>Table of Contents</b> (Click to expand)</summary>
23+
24+
25+
</details>
26+
27+
28+
## Overview
29+
30+
> Why Truncation Happens (Even in Shorter, Complex Documents)? <br/>
31+
> `Truncation happens when the total token count for your prompt and its output goes over the model’s limit. But here’s the catch: it’s not just about how long your prompt is, how complex it is can also bump up the token usage.`
32+
33+
| Cause | Description | Why It Matters |
34+
|-----------------------------|-----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|
35+
| Structural Complexity | Conditional logic, nested clauses, or sparse named entities lead to inefficient tokenization | Increases token count unexpectedly, risking mid-sentence truncation |
36+
| Tokenizer Behavior | Azure OpenAI uses subword tokenization (e.g., `tiktoken`) | Complex or rare words may consume more tokens than expected |
37+
| Verbose or Tangential Output| High temperature settings cause longer, less focused completions | May exceed token limits and truncate output mid-thought |
38+
39+
<details>
40+
<summary><b> Structural Complexity </b> (Click to expand)</summary>
41+
42+
> Documents with **conditional logic**, **nested clauses**, or **sparse named entities** are structurally complex. These patterns confuse tokenizers because they lack clear semantic anchors (like names or dates) and often involve long, interdependent clauses.
43+
44+
> E.g: `If the system fails to initialize, and the fallback protocol is not triggered unless the override is active, then the watchdog timer must be reset manually.`
45+
> This sentence, while not long, contains multiple conditions and dependencies. Tokenizers break it into many subword units, inflating the token count.
46+
47+
> Why It Matter?
48+
49+
- You may hit token limits even with seemingly short documents.
50+
- Truncation may occur mid-sentence or mid-logic, leading to incomplete or incoherent outputs.
51+
52+
> How to Address?
53+
54+
- Use **semantic chunking** to isolate logical units (e.g., one condition per chunk).
55+
- Preprocess documents to simplify or flatten nested logic where possible.
56+
57+
</details>
58+
59+
<details>
60+
<summary><b> Tokenizer Behavior </b> (Click to expand)</summary>
61+
62+
> Azure OpenAI uses the same tokenizer as OpenAI, typically `tiktoken`. This tokenizer breaks text into **subword tokens**, not full words. For example:
63+
> - “Initialization” → `["Initial", "ization"]`
64+
> - “FallbackProtocol” → `["Fallback", "Protocol"]`
65+
66+
> Complex syntax, rare words, or compound identifiers (like in code or legal text) often result in more tokens per word than expected.
67+
68+
> **Why It Matters**
69+
70+
- Token count can balloon unexpectedly, even in short or medium-length documents.
71+
- This can lead to premature truncation or rejection of prompts that exceed model limits.
72+
73+
> **How to Address**
74+
75+
- Use the `tiktoken` library to **pre-calculate token usage** before sending prompts.
76+
- Normalize or simplify text during preprocessing (e.g., split compound words).
77+
- Avoid overly technical phrasing unless necessary.
78+
79+
</details>
80+
81+
<details>
82+
<summary><b> Verbose or Tangential Output </b> (Click to expand)</summary>
83+
84+
> The `temperature` parameter controls randomness in model output:
85+
> - **High temperature (0.8–1.0)** → creative, verbose, tangential
86+
> - **Low temperature (0.2–0.4)** → focused, deterministic, concise
87+
88+
> High temperature can cause the model to “ramble”, using more tokens than necessary and increasing the risk of hitting token limits.
89+
90+
> **Why It Matters**
91+
92+
- Verbose completions may exceed token budgets, especially in stateless or high-throughput scenarios.
93+
- Truncation may occur mid-sentence or mid-thought, degrading output quality.
94+
95+
> **How to Address**
96+
97+
- For structured tasks (e.g., summarization, extraction), set:
98+
```json
99+
{
100+
"temperature": 0.2,
101+
"top_p": 0.9
102+
}
103+
```
104+
- Use `max_tokens` to cap output length.
105+
- Define `stop` sequences to cut off output at logical boundaries (e.g., `["\n\n", "###"]`).
106+
107+
</details>
108+
109+
110+
111+
112+
<div align="center">
113+
<h3 style="color: #4CAF50;">Total Visitors</h3>
114+
<img src="https://profile-counter.glitch.me/brown9804/count.svg" alt="Visitor Count" style="border: 2px solid #4CAF50; border-radius: 5px; padding: 5px;"/>
115+
</div>

0 commit comments

Comments
 (0)