Run DeepSeek-VL2 and InternVL3 on Lambda Labs or Modal. Both provide on-demand GPU access without managing infrastructure.
| Platform | Best For | Pricing | Setup Time |
|---|---|---|---|
| Lambda Labs | Interactive dev, long sessions | ~$1.10/hr (A10G) to $2.49/hr (A100) | 5 min |
| Modal | Serverless, auto-scaling, APIs | ~$0.000016/sec (A10G) | 10 min |
TL;DR: Use Lambda Labs for experimentation. Use Modal for production APIs.
Lambda Labs provides on-demand GPU instances. You SSH in, run your code, and pay by the hour.
- Go to lambdalabs.com/cloud
- Sign up / log in
- Click "Launch Instance"
- Select GPU:
| Model | GPU | Instance Type | Cost/hr |
|---|---|---|---|
| InternVL3-8B | A10G (24GB) | 1x A10G | ~$1.10 |
| DeepSeek-VL2-Tiny | A10G (24GB) | 1x A10G | ~$1.10 |
| DeepSeek-VL2-Small | A100 (80GB) | 1x A100 | ~$2.49 |
| InternVL3-38B | 2x A100 | 2x A100 | ~$4.98 |
- Add your SSH key
- Launch and wait ~2 minutes
# SSH into your instance
ssh ubuntu@<instance-ip>
# Clone the repo
git clone https://github.com/YOUR_USERNAME/vision-reasoning-deepseek.git
cd vision-reasoning-deepseek
# Install dependencies (Lambda instances have PyTorch pre-installed)
pip install -r requirements.txt# Test with a sample image
python -c "
from src.internvl3 import InternVL3
model = InternVL3(variant='8B').load()
print(model.chat('Describe this image', 'https://picsum.photos/800/600'))
"
# Run document extraction
python examples/extract_invoice.py your_invoice.pdf
# Run video analysis
python examples/detect_anomaly.py your_video.mp4 --context 'Normal conditions'# Upload files to instance
scp invoice.pdf ubuntu@<instance-ip>:~/vision-reasoning-deepseek/
# Download results
scp ubuntu@<instance-ip>:~/vision-reasoning-deepseek/results.json ./- Persistent storage: Use
/home/ubuntu/- it persists across reboots - Spot instances: Not available, but instances are reliable
- Stop vs terminate: You can pause instances to save costs (still pay for storage)
- Pre-download models: Models download on first run (~30GB for InternVL3-8B)
# Pre-download to avoid timeout during demo
python -c "from transformers import AutoModel; AutoModel.from_pretrained('OpenGVLab/InternVL3-8B', trust_remote_code=True)"Modal runs your code serverlessly - you define functions, Modal handles GPUs, scaling, and cold starts.
pip install modal
modal setup # Creates account and authenticatesCreate modal_app.py in your project root:
import modal
app = modal.App("vision-reasoning")
# Define container image with dependencies
image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install(
"torch",
"transformers",
"accelerate",
"pillow",
"vllm>=0.7.2",
"lmdeploy>=0.6.0",
"opencv-python-headless",
"decord",
"pymupdf",
)
.run_commands("pip install flash-attn --no-build-isolation")
)
# Pre-download model weights into image
def download_models():
from transformers import AutoModel, AutoTokenizer
AutoModel.from_pretrained("OpenGVLab/InternVL3-8B", trust_remote_code=True)
AutoTokenizer.from_pretrained("OpenGVLab/InternVL3-8B", trust_remote_code=True)
image = image.run_function(download_models)
@app.cls(
image=image,
gpu="A10G", # Or "A100" for larger models
timeout=600,
container_idle_timeout=300, # Keep warm for 5 min
)
class VisionModel:
@modal.enter()
def load_model(self):
"""Load model when container starts."""
from src.internvl3 import InternVL3
self.model = InternVL3(variant="8B").load()
@modal.method()
def analyze_image(self, image_bytes: bytes, prompt: str) -> str:
"""Analyze an image with a prompt."""
from PIL import Image
import io
img = Image.open(io.BytesIO(image_bytes))
return self.model.chat(prompt, img)
@modal.method()
def extract_document(self, pdf_bytes: bytes) -> dict:
"""Extract structured data from a document."""
from src.deepseek_vl2 import DocumentExtractor
import tempfile
with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as f:
f.write(pdf_bytes)
f.flush()
extractor = DocumentExtractor(variant="tiny")
return extractor.extract(f.name, output_format="json")
# Web endpoint for easy access
@app.function(image=image, gpu="A10G", timeout=300)
@modal.web_endpoint(method="POST")
def analyze(image_url: str, prompt: str = "Describe this image"):
"""HTTP endpoint for image analysis."""
import urllib.request
from PIL import Image
import io
# Download image
with urllib.request.urlopen(image_url) as response:
image_bytes = response.read()
# Run analysis
vision = VisionModel()
return {"result": vision.analyze_image.remote(image_bytes, prompt)}
# CLI entry point
@app.local_entrypoint()
def main(image_path: str, prompt: str = "Describe this image"):
"""Run from command line."""
with open(image_path, "rb") as f:
image_bytes = f.read()
vision = VisionModel()
result = vision.analyze_image.remote(image_bytes, prompt)
print(result)# Test locally (runs on Modal's cloud)
modal run modal_app.py --image-path invoice.png --prompt "Extract all text"
# Deploy as persistent endpoint
modal deploy modal_app.py
# Call the web endpoint
curl -X POST "https://YOUR_USERNAME--vision-reasoning-analyze.modal.run" \
-H "Content-Type: application/json" \
-d '{"image_url": "https://example.com/image.png", "prompt": "Describe this"}'For video processing, add this to modal_app.py:
@app.function(
image=image,
gpu="A10G",
timeout=1800, # 30 min for long videos
)
def analyze_video(video_bytes: bytes, context: str = "") -> list:
"""Detect anomalies in video."""
from src.internvl3 import VideoAnalyzer
import tempfile
with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as f:
f.write(video_bytes)
f.flush()
analyzer = VideoAnalyzer(variant="8B")
return analyzer.detect_anomalies(
f.name,
context=context,
check_interval=2.0,
)
# Use it
@app.local_entrypoint()
def process_video(video_path: str, context: str = ""):
with open(video_path, "rb") as f:
video_bytes = f.read()
anomalies = analyze_video.remote(video_bytes, context)
for a in anomalies:
print(f"[{a['timestamp']:.1f}s] {a['description']}")modal run modal_app.py::process_video --video-path factory.mp4 --context "Normal: boxes on belt"Modal charges per second of GPU usage:
| GPU | Per Second | Per Hour | Per 1000 Images (~5s each) |
|---|---|---|---|
| A10G | $0.000306 | $1.10 | $1.53 |
| A100 40GB | $0.001036 | $3.73 | $5.18 |
| A100 80GB | $0.001380 | $4.97 | $6.90 |
Cold start adds ~30-60s on first request. Use container_idle_timeout to keep containers warm.
- Cache model weights: Use
image.run_function()to bake weights into the container image - Warm containers: Set
container_idle_timeout=300to avoid cold starts - Batch requests: Process multiple images per container invocation
- Secrets: Use
modal.Secretfor API keys
@app.function(secrets=[modal.Secret.from_name("my-secret")])
def with_secrets():
import os
api_key = os.environ["API_KEY"]| Scenario | Use |
|---|---|
| Experimenting with models | Lambda Labs |
| Processing a batch of documents once | Lambda Labs |
| Building a production API | Modal |
| Auto-scaling based on demand | Modal |
| Long-running training jobs | Lambda Labs |
| Pay-per-request pricing | Modal |
| Need persistent filesystem | Lambda Labs |
| Team collaboration | Either (Modal has better sharing) |
# Use smaller model variant
model = InternVL3(variant="2B") # Instead of 8B
# Or use quantization (if supported)
model = InternVL3(variant="8B", torch_dtype=torch.float16)# Pre-download weights before running
python -c "
from huggingface_hub import snapshot_download
snapshot_download('OpenGVLab/InternVL3-8B')
"# Increase idle timeout
@app.cls(container_idle_timeout=600) # 10 minutes
# Or use modal.Cls.lookup() to keep containers warm
# by calling them periodically- Check availability in your region
- Try a different GPU type
- Instances are first-come-first-served