Custom fork of CVAT with YOLOE Visual Prompt and SAM3 (Segment Anything Model 3) integration for AI-assisted annotation.
| Model | Description | Capabilities |
|---|---|---|
| YOLOE Visual Prompt | Detection by visual examples | Rectangle, OBB (rotated), Polygon (segmentation) |
| SAM3 | Text-prompted segmentation | Text-to-Segment, Text-to-Detect, Text-to-Track |
- Docker and Docker Compose
- NVIDIA GPU with CUDA 12.4+ (minimum 8GB VRAM)
- nuctl v1.13.0 (Nuclio CLI)
# Install nuctl
wget https://github.com/nuclio/nuclio/releases/download/1.13.0/nuctl-1.13.0-linux-amd64
chmod +x nuctl-1.13.0-linux-amd64
sudo mv nuctl-1.13.0-linux-amd64 /usr/local/bin/nuctlSAM3 requires access to the model on HuggingFace:
# Install HuggingFace CLI
curl -LsSf https://hf.co/cli/install.sh | bash
# Login and download model (requires approval at https://huggingface.co/facebook/sam3)
huggingface-cli login
huggingface-cli download facebook/sam3# Clone repository
git clone https://github.com/mvaldi/cvat-yoloe-sam.git
cd cvat-yoloe-sam
# Start CVAT with all models
./zup.sh
# Or YOLOE only (without SAM3)
./zup.sh --no-sam3
# Or SAM3 only (without YOLOE)
./zup.sh --no-yoloe
# Base CVAT only (no AI models)
./zup.sh --no-sam3 --no-yoloeAccess CVAT at: http://localhost:8080
./zup.sh --host $(hostname -I | awk '{print $1}')# Stop containers
./zdown.sh
# Stop and clean Nuclio functions
./zdown.sh --clean- Create a Task and upload images/video
- Manually annotate some reference frames (minimum 1)
- Go to AI Tools → YOLOE
- Select reference frames and click Generate VPE
- Navigate to an unannotated frame
- Select Output Type: Rectangle | OBB | Polygon
- Adjust Confidence and click Detect
- Review and apply detections
- Go to AI Tools → SAM3
- Enter a text prompt (e.g., "person", "car", "dog")
- Select mode:
- Segment: Segment specific object
- Detect: Detect all instances
- Track: Track object in video
- Adjust confidence and apply results
| Configuration | Required VRAM |
|---|---|
| YOLOE only | ~4 GB |
| SAM3 only | ~6 GB |
| YOLOE + SAM3 | ~10 GB |
Note: With GPUs <12GB VRAM, use only one model at a time.
The first ./zup.sh will download models and build Docker images. This may take 10-30 minutes depending on your connection.
# View server logs
docker logs cvat_server --tail 50
# View YOLOE logs
docker logs nuclio-nuclio-pth-ultralytics-yoloe-visual-prompt --tail 50
# View SAM3 logs
docker logs nuclio-nuclio-pth-facebookresearch-sam3-gpu --tail 50
# Check Nuclio functions
nuctl get function --platform localFor complete CVAT documentation (formats, API, SDK, CLI):
MIT License - See LICENSE for details.
This project includes models with additional licenses:
- YOLOE: Ultralytics License
- SAM3: Meta AI License
