DeepSeek-VL2 is a Mixture-of-Experts (MoE) vision-language model that dominates OCR and document understanding benchmarks. Released December 2024.
| Benchmark | DeepSeek-VL2 | GPT-4o | Improvement |
|---|---|---|---|
| OCRBench | 834 | 736 | +13% |
| DocVQA | 93.3% | 92.8% | +0.5% |
| TextVQA | 84.2% | 77.4% | +9% |
| ChartQA | 85.7% | 78.5% | +9% |
The model uses "context optical compression" - encoding entire pages as compact vision tokens instead of individual text tokens. This achieves 20x compression with ~97% accuracy retention.
| Variant | Total Params | Activated | Min GPU | Best For |
|---|---|---|---|---|
| Tiny | 3.37B | 1.0B | 16GB | Prototyping, edge |
| Small | 16.1B | 2.8B | 80GB | Production OCR |
| Full | 27.5B | 4.5B | 80GB+ | Maximum accuracy |
The MoE architecture means only ~1/6th of parameters activate per token, making inference efficient.
pip install torch transformers accelerate pillow
pip install vllm>=0.7.2 # For production throughput
pip install pymupdf pdf2image # For PDF supportfrom src.deepseek_vl2 import DeepSeekVL2
# Load model
model = DeepSeekVL2(variant="small").load()
# Extract text from image
response = model.chat(
"Extract all text from this document",
"invoice.png"
)
print(response)from src.deepseek_vl2 import DocumentExtractor
# Initialize extractor
extractor = DocumentExtractor(variant="small")
# Extract structured invoice data
invoice = extractor.extract("invoice.pdf", output_format="invoice")
print(invoice)
# {
# "vendor_name": "Acme Corp",
# "invoice_number": "INV-2024-001",
# "total": 1250.00,
# "line_items": [...]
# }
# Extract tables
tables = extractor.extract_tables("report.pdf")
for table in tables:
print(table["headers"])
for row in table["rows"]:
print(row)
# Custom schema extraction
schema = {
"company": "string",
"date": "string",
"items": [{"name": "string", "qty": "number", "price": "number"}]
}
data = extractor.extract("purchase_order.png", schema=schema)from src.deepseek_vl2 import DocumentExtractor
with DocumentExtractor() as extractor:
# Process all pages
results = extractor.extract("contract.pdf", output_format="json")
# Results is a list, one dict per page
for i, page_data in enumerate(results):
print(f"Page {i+1}: {page_data}")# Find differences between two versions
diff = extractor.compare_documents("contract_v1.pdf", "contract_v2.pdf")
print(diff["summary"])
print(diff["differences"])For high-throughput production, use the vLLM backend:
from src.deepseek_vl2 import DeepSeekVL2
# vLLM provides continuous batching and optimized CUDA kernels
model = DeepSeekVL2(
variant="small",
backend="vllm", # Production backend
).load()
# Process batch of documents
documents = ["doc1.png", "doc2.png", "doc3.png"]
for doc in documents:
result = model.chat("Extract all text", doc)For API serving:
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/deepseek-vl2-small \
--trust-remote-code \
--dtype bfloat16
# Query via API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/deepseek-vl2-small",
"messages": [{"role": "user", "content": "Extract text from <image>"}],
"images": ["base64_encoded_image"]
}'DeepSeek-VL2 uses dynamic tiling (384x384 tiles). For best results:
- Keep images under 4096x4096 for speed
- Don't downscale below 768px for small text
- Higher DPI (150+) for PDFs with small fonts
from src.utils import load_pdf_pages
# Higher DPI for better OCR on small text
pages = load_pdf_pages("dense_document.pdf", dpi=200)Be specific about what you want extracted:
# Good: Specific extraction request
prompt = """Extract the following from this invoice:
- Vendor name and address
- Invoice number and date
- Line items with quantities and prices
- Subtotal, tax, and total
Return as JSON."""
# Less good: Vague request
prompt = "What's in this document?"DeepSeek-VL2 disables dynamic tiling with >2 images. For multi-image:
# Process separately for best quality
for page in pdf_pages:
result = extractor.extract(page)
# Or batch with lower resolution
results = model.chat("Compare these pages", pages[:4])- Use
temperature=0.0for extraction (deterministic) - Use
temperature=0.3for summarization (slight creativity)
# Extraction: deterministic
data = extractor.extract(doc, output_format="invoice") # Uses temp=0.0
# Q&A: Allow some flexibility
answer = model.chat(question, doc, temperature=0.3)# Use smaller variant
model = DeepSeekVL2(variant="tiny") # 16GB GPU
# Or reduce image resolution
from src.utils import resize_for_model
img = resize_for_model(large_image, model="deepseek", max_tokens=2048)# Switch to vLLM backend
model = DeepSeekVL2(variant="small", backend="vllm")
# Or use flash attention (default with transformers)
model = DeepSeekVL2(variant="small") # Uses flash_attention_2The DocumentExtractor handles JSON parsing automatically, including extracting JSON from markdown code blocks. If you're using the model directly:
import json
import re
response = model.chat(prompt, image)
# Extract JSON from code blocks
match = re.search(r"```(?:json)?\s*([\s\S]*?)```", response)
if match:
data = json.loads(match.group(1))
else:
data = json.loads(response)- Batch processing: Process multiple single-page documents per model load
- PDF DPI: 150 DPI balances quality and speed for most documents
- Preprocessing: Deskew and denoise images before OCR
- Caching: Cache model weights with
TRANSFORMERS_CACHEenv var
See source code documentation:
src/deepseek_vl2/loader.py- Model loadingsrc/deepseek_vl2/document_ocr.py- Document extraction