Cloud Deployment Guide

Run DeepSeek-VL2 and InternVL3 on Lambda Labs or Modal. Both provide on-demand GPU access without managing infrastructure.

Quick Comparison

Platform	Best For	Pricing	Setup Time
Lambda Labs	Interactive dev, long sessions	~$1.10/hr (A10G) to $2.49/hr (A100)	5 min
Modal	Serverless, auto-scaling, APIs	~$0.000016/sec (A10G)	10 min

TL;DR: Use Lambda Labs for experimentation. Use Modal for production APIs.

Lambda Labs

Lambda Labs provides on-demand GPU instances. You SSH in, run your code, and pay by the hour.

Step 1: Launch an Instance

Go to lambdalabs.com/cloud
Sign up / log in
Click "Launch Instance"
Select GPU:

Model	GPU	Instance Type	Cost/hr
InternVL3-8B	A10G (24GB)	1x A10G	~$1.10
DeepSeek-VL2-Tiny	A10G (24GB)	1x A10G	~$1.10
DeepSeek-VL2-Small	A100 (80GB)	1x A100	~$2.49
InternVL3-38B	2x A100	2x A100	~$4.98

Add your SSH key
Launch and wait ~2 minutes

Step 2: Connect and Setup

# SSH into your instance
ssh ubuntu@<instance-ip>

# Clone the repo
git clone https://github.com/YOUR_USERNAME/vision-reasoning-deepseek.git
cd vision-reasoning-deepseek

# Install dependencies (Lambda instances have PyTorch pre-installed)
pip install -r requirements.txt

Step 3: Run

# Test with a sample image
python -c "
from src.internvl3 import InternVL3
model = InternVL3(variant='8B').load()
print(model.chat('Describe this image', 'https://picsum.photos/800/600'))
"

# Run document extraction
python examples/extract_invoice.py your_invoice.pdf

# Run video analysis
python examples/detect_anomaly.py your_video.mp4 --context 'Normal conditions'

Step 4: Transfer Files

# Upload files to instance
scp invoice.pdf ubuntu@<instance-ip>:~/vision-reasoning-deepseek/

# Download results
scp ubuntu@<instance-ip>:~/vision-reasoning-deepseek/results.json ./

Tips for Lambda Labs

Persistent storage: Use /home/ubuntu/ - it persists across reboots
Spot instances: Not available, but instances are reliable
Stop vs terminate: You can pause instances to save costs (still pay for storage)
Pre-download models: Models download on first run (~30GB for InternVL3-8B)

# Pre-download to avoid timeout during demo
python -c "from transformers import AutoModel; AutoModel.from_pretrained('OpenGVLab/InternVL3-8B', trust_remote_code=True)"

Modal

Modal runs your code serverlessly - you define functions, Modal handles GPUs, scaling, and cold starts.

Step 1: Install Modal

pip install modal
modal setup  # Creates account and authenticates

Step 2: Create Modal App

Create modal_app.py in your project root:

import modal

app = modal.App("vision-reasoning")

# Define container image with dependencies
image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "torch",
        "transformers",
        "accelerate",
        "pillow",
        "vllm>=0.7.2",
        "lmdeploy>=0.6.0",
        "opencv-python-headless",
        "decord",
        "pymupdf",
    )
    .run_commands("pip install flash-attn --no-build-isolation")
)

# Pre-download model weights into image
def download_models():
    from transformers import AutoModel, AutoTokenizer
    AutoModel.from_pretrained("OpenGVLab/InternVL3-8B", trust_remote_code=True)
    AutoTokenizer.from_pretrained("OpenGVLab/InternVL3-8B", trust_remote_code=True)

image = image.run_function(download_models)


@app.cls(
    image=image,
    gpu="A10G",  # Or "A100" for larger models
    timeout=600,
    container_idle_timeout=300,  # Keep warm for 5 min
)
class VisionModel:
    @modal.enter()
    def load_model(self):
        """Load model when container starts."""
        from src.internvl3 import InternVL3
        self.model = InternVL3(variant="8B").load()

    @modal.method()
    def analyze_image(self, image_bytes: bytes, prompt: str) -> str:
        """Analyze an image with a prompt."""
        from PIL import Image
        import io

        img = Image.open(io.BytesIO(image_bytes))
        return self.model.chat(prompt, img)

    @modal.method()
    def extract_document(self, pdf_bytes: bytes) -> dict:
        """Extract structured data from a document."""
        from src.deepseek_vl2 import DocumentExtractor
        import tempfile

        with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as f:
            f.write(pdf_bytes)
            f.flush()

            extractor = DocumentExtractor(variant="tiny")
            return extractor.extract(f.name, output_format="json")


# Web endpoint for easy access
@app.function(image=image, gpu="A10G", timeout=300)
@modal.web_endpoint(method="POST")
def analyze(image_url: str, prompt: str = "Describe this image"):
    """HTTP endpoint for image analysis."""
    import urllib.request
    from PIL import Image
    import io

    # Download image
    with urllib.request.urlopen(image_url) as response:
        image_bytes = response.read()

    # Run analysis
    vision = VisionModel()
    return {"result": vision.analyze_image.remote(image_bytes, prompt)}


# CLI entry point
@app.local_entrypoint()
def main(image_path: str, prompt: str = "Describe this image"):
    """Run from command line."""
    with open(image_path, "rb") as f:
        image_bytes = f.read()

    vision = VisionModel()
    result = vision.analyze_image.remote(image_bytes, prompt)
    print(result)

Step 3: Deploy and Run

# Test locally (runs on Modal's cloud)
modal run modal_app.py --image-path invoice.png --prompt "Extract all text"

# Deploy as persistent endpoint
modal deploy modal_app.py

# Call the web endpoint
curl -X POST "https://YOUR_USERNAME--vision-reasoning-analyze.modal.run" \
    -H "Content-Type: application/json" \
    -d '{"image_url": "https://example.com/image.png", "prompt": "Describe this"}'

Step 4: Video Analysis on Modal

For video processing, add this to modal_app.py:

@app.function(
    image=image,
    gpu="A10G",
    timeout=1800,  # 30 min for long videos
)
def analyze_video(video_bytes: bytes, context: str = "") -> list:
    """Detect anomalies in video."""
    from src.internvl3 import VideoAnalyzer
    import tempfile

    with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as f:
        f.write(video_bytes)
        f.flush()

        analyzer = VideoAnalyzer(variant="8B")
        return analyzer.detect_anomalies(
            f.name,
            context=context,
            check_interval=2.0,
        )


# Use it
@app.local_entrypoint()
def process_video(video_path: str, context: str = ""):
    with open(video_path, "rb") as f:
        video_bytes = f.read()

    anomalies = analyze_video.remote(video_bytes, context)
    for a in anomalies:
        print(f"[{a['timestamp']:.1f}s] {a['description']}")

modal run modal_app.py::process_video --video-path factory.mp4 --context "Normal: boxes on belt"

Modal Pricing

Modal charges per second of GPU usage:

GPU	Per Second	Per Hour	Per 1000 Images (~5s each)
A10G	$0.000306	$1.10	$1.53
A100 40GB	$0.001036	$3.73	$5.18
A100 80GB	$0.001380	$4.97	$6.90

Cold start adds ~30-60s on first request. Use container_idle_timeout to keep containers warm.

Tips for Modal

Cache model weights: Use image.run_function() to bake weights into the container image
Warm containers: Set container_idle_timeout=300 to avoid cold starts
Batch requests: Process multiple images per container invocation
Secrets: Use modal.Secret for API keys

@app.function(secrets=[modal.Secret.from_name("my-secret")])
def with_secrets():
    import os
    api_key = os.environ["API_KEY"]

Comparison: When to Use Which

Scenario	Use
Experimenting with models	Lambda Labs
Processing a batch of documents once	Lambda Labs
Building a production API	Modal
Auto-scaling based on demand	Modal
Long-running training jobs	Lambda Labs
Pay-per-request pricing	Modal
Need persistent filesystem	Lambda Labs
Team collaboration	Either (Modal has better sharing)

Troubleshooting

CUDA Out of Memory

# Use smaller model variant
model = InternVL3(variant="2B")  # Instead of 8B

# Or use quantization (if supported)
model = InternVL3(variant="8B", torch_dtype=torch.float16)

Slow Model Loading

# Pre-download weights before running
python -c "
from huggingface_hub import snapshot_download
snapshot_download('OpenGVLab/InternVL3-8B')
"

Modal Cold Start Too Slow

# Increase idle timeout
@app.cls(container_idle_timeout=600)  # 10 minutes

# Or use modal.Cls.lookup() to keep containers warm
# by calling them periodically

Lambda Instance Won't Start

Check availability in your region
Try a different GPU type
Instances are first-come-first-served

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud Deployment Guide

Quick Comparison

Lambda Labs

Step 1: Launch an Instance

Step 2: Connect and Setup

Step 3: Run

Step 4: Transfer Files

Tips for Lambda Labs

Modal

Step 1: Install Modal

Step 2: Create Modal App

Step 3: Deploy and Run

Step 4: Video Analysis on Modal

Modal Pricing

Tips for Modal

Comparison: When to Use Which

Troubleshooting

CUDA Out of Memory

Slow Model Loading

Modal Cold Start Too Slow

Lambda Instance Won't Start

FilesExpand file tree

cloud-deployment.md

Latest commit

History

cloud-deployment.md

File metadata and controls

Cloud Deployment Guide

Quick Comparison

Lambda Labs

Step 1: Launch an Instance

Step 2: Connect and Setup

Step 3: Run

Step 4: Transfer Files

Tips for Lambda Labs

Modal

Step 1: Install Modal

Step 2: Create Modal App

Step 3: Deploy and Run

Step 4: Video Analysis on Modal

Modal Pricing

Tips for Modal

Comparison: When to Use Which

Troubleshooting

CUDA Out of Memory

Slow Model Loading

Modal Cold Start Too Slow

Lambda Instance Won't Start