DeepSeek-VL2: Document OCR & Extraction

DeepSeek-VL2 is a Mixture-of-Experts (MoE) vision-language model that dominates OCR and document understanding benchmarks. Released December 2024.

Why DeepSeek-VL2 for Documents

Benchmark	DeepSeek-VL2	GPT-4o	Improvement
OCRBench	834	736	+13%
DocVQA	93.3%	92.8%	+0.5%
TextVQA	84.2%	77.4%	+9%
ChartQA	85.7%	78.5%	+9%

The model uses "context optical compression" - encoding entire pages as compact vision tokens instead of individual text tokens. This achieves 20x compression with ~97% accuracy retention.

Model Variants

Variant	Total Params	Activated	Min GPU	Best For
Tiny	3.37B	1.0B	16GB	Prototyping, edge
Small	16.1B	2.8B	80GB	Production OCR
Full	27.5B	4.5B	80GB+	Maximum accuracy

The MoE architecture means only ~1/6th of parameters activate per token, making inference efficient.

Installation

pip install torch transformers accelerate pillow
pip install vllm>=0.7.2  # For production throughput
pip install pymupdf pdf2image  # For PDF support

Quick Start

Basic Usage

from src.deepseek_vl2 import DeepSeekVL2

# Load model
model = DeepSeekVL2(variant="small").load()

# Extract text from image
response = model.chat(
    "Extract all text from this document",
    "invoice.png"
)
print(response)

Document Extraction Pipeline

from src.deepseek_vl2 import DocumentExtractor

# Initialize extractor
extractor = DocumentExtractor(variant="small")

# Extract structured invoice data
invoice = extractor.extract("invoice.pdf", output_format="invoice")
print(invoice)
# {
#   "vendor_name": "Acme Corp",
#   "invoice_number": "INV-2024-001",
#   "total": 1250.00,
#   "line_items": [...]
# }

# Extract tables
tables = extractor.extract_tables("report.pdf")
for table in tables:
    print(table["headers"])
    for row in table["rows"]:
        print(row)

# Custom schema extraction
schema = {
    "company": "string",
    "date": "string",
    "items": [{"name": "string", "qty": "number", "price": "number"}]
}
data = extractor.extract("purchase_order.png", schema=schema)

Multi-Page PDFs

from src.deepseek_vl2 import DocumentExtractor

with DocumentExtractor() as extractor:
    # Process all pages
    results = extractor.extract("contract.pdf", output_format="json")

    # Results is a list, one dict per page
    for i, page_data in enumerate(results):
        print(f"Page {i+1}: {page_data}")

Comparing Documents

# Find differences between two versions
diff = extractor.compare_documents("contract_v1.pdf", "contract_v2.pdf")
print(diff["summary"])
print(diff["differences"])

Production Deployment with vLLM

For high-throughput production, use the vLLM backend:

from src.deepseek_vl2 import DeepSeekVL2

# vLLM provides continuous batching and optimized CUDA kernels
model = DeepSeekVL2(
    variant="small",
    backend="vllm",  # Production backend
).load()

# Process batch of documents
documents = ["doc1.png", "doc2.png", "doc3.png"]
for doc in documents:
    result = model.chat("Extract all text", doc)

vLLM Server Mode

For API serving:

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/deepseek-vl2-small \
    --trust-remote-code \
    --dtype bfloat16

# Query via API
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-ai/deepseek-vl2-small",
        "messages": [{"role": "user", "content": "Extract text from <image>"}],
        "images": ["base64_encoded_image"]
    }'

Best Practices

1. Image Resolution

DeepSeek-VL2 uses dynamic tiling (384x384 tiles). For best results:

Keep images under 4096x4096 for speed
Don't downscale below 768px for small text
Higher DPI (150+) for PDFs with small fonts

from src.utils import load_pdf_pages

# Higher DPI for better OCR on small text
pages = load_pdf_pages("dense_document.pdf", dpi=200)

2. Prompt Engineering

Be specific about what you want extracted:

# Good: Specific extraction request
prompt = """Extract the following from this invoice:
- Vendor name and address
- Invoice number and date
- Line items with quantities and prices
- Subtotal, tax, and total

Return as JSON."""

# Less good: Vague request
prompt = "What's in this document?"

3. Handling Multiple Images

DeepSeek-VL2 disables dynamic tiling with >2 images. For multi-image:

# Process separately for best quality
for page in pdf_pages:
    result = extractor.extract(page)

# Or batch with lower resolution
results = model.chat("Compare these pages", pages[:4])

4. Temperature Settings

Use temperature=0.0 for extraction (deterministic)
Use temperature=0.3 for summarization (slight creativity)

# Extraction: deterministic
data = extractor.extract(doc, output_format="invoice")  # Uses temp=0.0

# Q&A: Allow some flexibility
answer = model.chat(question, doc, temperature=0.3)

Common Issues

Out of Memory

# Use smaller variant
model = DeepSeekVL2(variant="tiny")  # 16GB GPU

# Or reduce image resolution
from src.utils import resize_for_model
img = resize_for_model(large_image, model="deepseek", max_tokens=2048)

Slow Inference

# Switch to vLLM backend
model = DeepSeekVL2(variant="small", backend="vllm")

# Or use flash attention (default with transformers)
model = DeepSeekVL2(variant="small")  # Uses flash_attention_2

JSON Parsing Errors

The DocumentExtractor handles JSON parsing automatically, including extracting JSON from markdown code blocks. If you're using the model directly:

import json
import re

response = model.chat(prompt, image)

# Extract JSON from code blocks
match = re.search(r"```(?:json)?\s*([\s\S]*?)```", response)
if match:
    data = json.loads(match.group(1))
else:
    data = json.loads(response)

Performance Tips

Batch processing: Process multiple single-page documents per model load
PDF DPI: 150 DPI balances quality and speed for most documents
Preprocessing: Deskew and denoise images before OCR
Caching: Cache model weights with TRANSFORMERS_CACHE env var

API Reference

See source code documentation:

src/deepseek_vl2/loader.py - Model loading
src/deepseek_vl2/document_ocr.py - Document extraction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeek-VL2: Document OCR & Extraction

Why DeepSeek-VL2 for Documents

Model Variants

Installation

Quick Start

Basic Usage

Document Extraction Pipeline

Multi-Page PDFs

Comparing Documents

Production Deployment with vLLM

vLLM Server Mode

Best Practices

1. Image Resolution

2. Prompt Engineering

3. Handling Multiple Images

4. Temperature Settings

Common Issues

Out of Memory

Slow Inference

JSON Parsing Errors

Performance Tips

API Reference

FilesExpand file tree

deepseek-vl2.md

Latest commit

History

deepseek-vl2.md

File metadata and controls

DeepSeek-VL2: Document OCR & Extraction

Why DeepSeek-VL2 for Documents

Model Variants

Installation

Quick Start

Basic Usage

Document Extraction Pipeline

Multi-Page PDFs

Comparing Documents

Production Deployment with vLLM

vLLM Server Mode

Best Practices

1. Image Resolution

2. Prompt Engineering

3. Handling Multiple Images

4. Temperature Settings

Common Issues

Out of Memory

Slow Inference

JSON Parsing Errors

Performance Tips

API Reference