Skip to content

Latest commit

 

History

History
261 lines (192 loc) · 6.46 KB

File metadata and controls

261 lines (192 loc) · 6.46 KB

DeepSeek-VL2: Document OCR & Extraction

DeepSeek-VL2 is a Mixture-of-Experts (MoE) vision-language model that dominates OCR and document understanding benchmarks. Released December 2024.

Why DeepSeek-VL2 for Documents

Benchmark DeepSeek-VL2 GPT-4o Improvement
OCRBench 834 736 +13%
DocVQA 93.3% 92.8% +0.5%
TextVQA 84.2% 77.4% +9%
ChartQA 85.7% 78.5% +9%

The model uses "context optical compression" - encoding entire pages as compact vision tokens instead of individual text tokens. This achieves 20x compression with ~97% accuracy retention.

Model Variants

Variant Total Params Activated Min GPU Best For
Tiny 3.37B 1.0B 16GB Prototyping, edge
Small 16.1B 2.8B 80GB Production OCR
Full 27.5B 4.5B 80GB+ Maximum accuracy

The MoE architecture means only ~1/6th of parameters activate per token, making inference efficient.

Installation

pip install torch transformers accelerate pillow
pip install vllm>=0.7.2  # For production throughput
pip install pymupdf pdf2image  # For PDF support

Quick Start

Basic Usage

from src.deepseek_vl2 import DeepSeekVL2

# Load model
model = DeepSeekVL2(variant="small").load()

# Extract text from image
response = model.chat(
    "Extract all text from this document",
    "invoice.png"
)
print(response)

Document Extraction Pipeline

from src.deepseek_vl2 import DocumentExtractor

# Initialize extractor
extractor = DocumentExtractor(variant="small")

# Extract structured invoice data
invoice = extractor.extract("invoice.pdf", output_format="invoice")
print(invoice)
# {
#   "vendor_name": "Acme Corp",
#   "invoice_number": "INV-2024-001",
#   "total": 1250.00,
#   "line_items": [...]
# }

# Extract tables
tables = extractor.extract_tables("report.pdf")
for table in tables:
    print(table["headers"])
    for row in table["rows"]:
        print(row)

# Custom schema extraction
schema = {
    "company": "string",
    "date": "string",
    "items": [{"name": "string", "qty": "number", "price": "number"}]
}
data = extractor.extract("purchase_order.png", schema=schema)

Multi-Page PDFs

from src.deepseek_vl2 import DocumentExtractor

with DocumentExtractor() as extractor:
    # Process all pages
    results = extractor.extract("contract.pdf", output_format="json")

    # Results is a list, one dict per page
    for i, page_data in enumerate(results):
        print(f"Page {i+1}: {page_data}")

Comparing Documents

# Find differences between two versions
diff = extractor.compare_documents("contract_v1.pdf", "contract_v2.pdf")
print(diff["summary"])
print(diff["differences"])

Production Deployment with vLLM

For high-throughput production, use the vLLM backend:

from src.deepseek_vl2 import DeepSeekVL2

# vLLM provides continuous batching and optimized CUDA kernels
model = DeepSeekVL2(
    variant="small",
    backend="vllm",  # Production backend
).load()

# Process batch of documents
documents = ["doc1.png", "doc2.png", "doc3.png"]
for doc in documents:
    result = model.chat("Extract all text", doc)

vLLM Server Mode

For API serving:

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/deepseek-vl2-small \
    --trust-remote-code \
    --dtype bfloat16

# Query via API
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-ai/deepseek-vl2-small",
        "messages": [{"role": "user", "content": "Extract text from <image>"}],
        "images": ["base64_encoded_image"]
    }'

Best Practices

1. Image Resolution

DeepSeek-VL2 uses dynamic tiling (384x384 tiles). For best results:

  • Keep images under 4096x4096 for speed
  • Don't downscale below 768px for small text
  • Higher DPI (150+) for PDFs with small fonts
from src.utils import load_pdf_pages

# Higher DPI for better OCR on small text
pages = load_pdf_pages("dense_document.pdf", dpi=200)

2. Prompt Engineering

Be specific about what you want extracted:

# Good: Specific extraction request
prompt = """Extract the following from this invoice:
- Vendor name and address
- Invoice number and date
- Line items with quantities and prices
- Subtotal, tax, and total

Return as JSON."""

# Less good: Vague request
prompt = "What's in this document?"

3. Handling Multiple Images

DeepSeek-VL2 disables dynamic tiling with >2 images. For multi-image:

# Process separately for best quality
for page in pdf_pages:
    result = extractor.extract(page)

# Or batch with lower resolution
results = model.chat("Compare these pages", pages[:4])

4. Temperature Settings

  • Use temperature=0.0 for extraction (deterministic)
  • Use temperature=0.3 for summarization (slight creativity)
# Extraction: deterministic
data = extractor.extract(doc, output_format="invoice")  # Uses temp=0.0

# Q&A: Allow some flexibility
answer = model.chat(question, doc, temperature=0.3)

Common Issues

Out of Memory

# Use smaller variant
model = DeepSeekVL2(variant="tiny")  # 16GB GPU

# Or reduce image resolution
from src.utils import resize_for_model
img = resize_for_model(large_image, model="deepseek", max_tokens=2048)

Slow Inference

# Switch to vLLM backend
model = DeepSeekVL2(variant="small", backend="vllm")

# Or use flash attention (default with transformers)
model = DeepSeekVL2(variant="small")  # Uses flash_attention_2

JSON Parsing Errors

The DocumentExtractor handles JSON parsing automatically, including extracting JSON from markdown code blocks. If you're using the model directly:

import json
import re

response = model.chat(prompt, image)

# Extract JSON from code blocks
match = re.search(r"```(?:json)?\s*([\s\S]*?)```", response)
if match:
    data = json.loads(match.group(1))
else:
    data = json.loads(response)

Performance Tips

  1. Batch processing: Process multiple single-page documents per model load
  2. PDF DPI: 150 DPI balances quality and speed for most documents
  3. Preprocessing: Deskew and denoise images before OCR
  4. Caching: Cache model weights with TRANSFORMERS_CACHE env var

API Reference

See source code documentation:

  • src/deepseek_vl2/loader.py - Model loading
  • src/deepseek_vl2/document_ocr.py - Document extraction