Run MiniMax M2.5 (230B params, 10B active MoE) locally on NVIDIA DGX Spark with an OpenAI-compatible API.
# 1. Download model (~101GB)
huggingface-cli download unsloth/MiniMax-M2.5-GGUF \
--local-dir ./models --include '*UD-Q3_K_XL*'
# 2. Build and start
cd docker
docker compose build # First time only
docker compose up -d
# 3. Test
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "minimax-m2.5", "messages": [{"role": "user", "content": "Hello"}]}'docker compose up -d # Start
docker compose down # Stop
docker compose logs -f # Logs
docker compose ps # StatusTarget: NVIDIA DGX Spark (GB10 Grace Blackwell, 128GB unified memory)
Key settings in docker-compose.yml:
| Setting | Value | Purpose |
|---|---|---|
-ngl 999 |
All layers on GPU | Full GPU acceleration |
-c 131072 |
128K context | Large context window |
-fa on |
Flash Attention | Memory efficiency |
--temp 1.0 |
MiniMax default | Recommended sampling |
docker compose logs # Check errors
ls -lh ../models/UD-Q3_K_XL/ # Verify model exists
curl http://localhost:8080/health # Health check- MiniMax M2.5 UD-Q3_K_XL via Unsloth
- 230B total params, 10B active (MoE), 200K context
- 80.2% SWE-Bench Verified