Junfu Pu*,
Yuxin Chen*,
Teng Wang*,
Ying Shan
ARC Lab, Tencent
*Equal Contribution
Note: Our code and model weights are currently undergoing internal open-source review. They will be publicly released once the review is complete.
- [2026/04/24] The Online Demo and API Service are now available.
- [2026/04/22] Release the technical report, project page, and examples.
OmniScript is an 8B-parameter omni-modal (audio + visual) language model built on Qwen3VL-8B, designed for the Video-to-Script (V2S) task: converting long-form cinematic videos into hierarchical, scene-by-scene structured screenplays.
Given a video (up to 5 minutes natively, or longer via two-stage generation), OmniScript produces:
- Meta - title,, character list
- Scene-level script - location, environment, time, mood for each scene
- Event-level script - timestamped character actions, dialogues, expressions, audio cues, and subtext
- Chain-of-Thought reasoning - plot summary and character relationship analysis before structured output
The model is trained via a progressive pipeline: modality alignment (1M videos) → multimodal pretraining (2.4M videos) → CoT supervised fine-tuning (45K videos) → reinforcement learning with temporally segmented rewards (GRPO). Despite its 8B parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to Gemini 3-Pro on both temporal localization and multi-field semantic accuracy.
| Model | Params | Omni | Char. | Dia. | Act. | Exp. | Aud. | Overall | tIoU@0.1 |
|---|---|---|---|---|---|---|---|---|---|
| Proprietary Models | |||||||||
| Gemini-3-flash | - | Yes | 28.8 | 50.3 | 28.2 | 25.5 | 11.2 | 28.8 | 44.3 |
| Gemini-3-pro | - | Yes | 39.8 | 68.8 | 37.4 | 35.4 | 13.3 | 38.9 | 64.4 |
| Gemini-2.5-flash | - | Yes | 40.1 | 75.5 | 42.8 | 36.5 | 22.8 | 43.6 | 74.3 |
| Gemini-2.5-pro | - | Yes | 41.7 | 75.0 | 41.9 | 39.0 | 17.0 | 42.9 | 73.4 |
| Seed-1.8 | - | No | 40.9 | 54.4 | 35.1 | 29.6 | 12.4 | 34.5 | 50.7 |
| Seed-2.0-pro | - | No | 47.4 | 68.1 | 42.9 | 35.7 | 10.3 | 40.9 | 67.1 |
| Open-source Models | |||||||||
| Qwen3VL | 8B | No | 30.4 | 49.6 | 26.9 | 25.3 | 6.6 | 27.7 | 47.6 |
| Qwen3VL | 32B | No | 37.1 | 57.1 | 31.3 | 28.7 | 7.2 | 32.3 | 52.5 |
| Qwen3VL | 235B/22B | No | 38.1 | 58.6 | 33.0 | 29.1 | 6.0 | 33.0 | 62.0 |
| TimeChat-Captioner | 8B | Yes | 6.9 | 6.9 | 6.6 | 9.7 | 8.2 | 7.7 | 16.1 |
| MiniCPM-O-4.5 | 9B | Yes | 3.1 | 8.0 | 2.9 | 3.2 | 2.4 | 3.9 | 3.2 |
| OmniScript (Ours) | 8B | Yes | 39.2 | 72.2 | 33.7 | 31.9 | 11.6 | 37.7 | 69.3 |
| Model | Params | Omni | Loc. | Type | Env. | Time | Mood | Overall | tIoU@0.1 |
|---|---|---|---|---|---|---|---|---|---|
| Proprietary Models | |||||||||
| Gemini-3-flash | - | Yes | 54.6 | 59.8 | 42.7 | 54.9 | 50.4 | 52.5 | 70.3 |
| Gemini-3-pro | - | Yes | 58.8 | 63.1 | 46.9 | 61.6 | 54.8 | 57.0 | 75.3 |
| Gemini-2.5-flash | - | Yes | 52.8 | 57.1 | 45.7 | 56.1 | 50.3 | 52.3 | 69.6 |
| Gemini-2.5-pro | - | Yes | 56.6 | 62.4 | 50.8 | 60.1 | 54.6 | 56.9 | 74.1 |
| Seed-1.8 | - | No | 57.9 | 58.6 | 47.7 | 58.7 | 52.8 | 55.1 | 74.0 |
| Seed-2.0-pro | - | No | 57.7 | 62.2 | 49.2 | 62.7 | 54.3 | 57.2 | 75.5 |
| Open-source Models | |||||||||
| Qwen3-Omni | 30B/3B | Yes | 18.4 | 26.0 | 14.6 | 23.6 | 22.4 | 21.0 | 29.6 |
| Qwen3VL | 8B | No | 41.3 | 49.7 | 31.8 | 39.8 | 41.7 | 40.9 | 60.6 |
| Qwen3VL | 32B | No | 50.4 | 58.7 | 42.7 | 55.4 | 47.9 | 51.0 | 71.1 |
| Qwen3VL | 235B/22B | No | 52.6 | 60.2 | 45.4 | 57.9 | 50.9 | 53.4 | 72.8 |
| TimeChat-Captioner | 8B | Yes | 19.9 | 29.5 | 17.3 | 28.8 | 30.5 | 25.2 | 46.6 |
| MiniCPM-O-4.5 | 9B | Yes | 10.3 | 22.0 | 8.4 | 17.7 | 17.4 | 15.1 | 32.0 |
| OmniScript (Ours) | 8B | Yes | 54.0 | 58.4 | 41.9 | 58.1 | 49.5 | 52.4 | 74.6 |
With only 8B parameters, OmniScript outperforms all open-source models (including Qwen3VL-235B) and achieves performance comparable to proprietary models like Gemini-3-Pro on both event-level and scene-level metrics.
We provide an Online OmniScript Demo hosted on the ARC Lab website where you can upload a video and experience OmniScript directly - no local setup required.
How to find demo on ARC Lab Homepage:
ARC Lab -> AI Demo -> Register with Phone No. -> Multimodal Comprehension and Generation -> ARC-OmniScript
We provide model access via API service. A brief tutorial on how to use the API is as follow. For more details, please refer to the documentation.
Prior to using the OmniScript API, obtaining an access token (ARC-Token) is mandatory. Users who are not logged in are required to complete account verification first.
Steps to get your token:
- Log in: Visit ARC Website and log in with your mobile number.
- Retrieve Token: Once logged in, click the user icon in the top-right corner and select "View Token" from the dropdown menu to get your ARC-TOKEN.
- Python >= 3.10
- CUDA-capable GPU (16 GB+ VRAM recommended)
- Conda
source setup_env.shThis will create a conda environment omniscript and install all dependencies (PyTorch, FFmpeg, Flash Attention, etc.).
A Flask-based interface for uploading videos and viewing structured screenplay results with interactive timestamp navigation.
python demo.py \
--model_path /path/to/model \
--whisper_model_path /path/to/whisper-large-v3 \
--port 8080| Argument | Default | Description |
|---|---|---|
--model_path |
(required) | Path to the pretrained model checkpoint |
--whisper_model_path |
None |
Path to Whisper model for audio processing |
--port |
8080 |
Port for the Flask server |
--host |
0.0.0.0 |
Listening address |
--debug |
False |
Use built-in sample data, no GPU needed |
If
--whisper_model_pathis not specified,openai/whisper-large-v3will be downloaded automatically from HuggingFace.
To quickly preview the UI without a GPU (use examples/debug.mp4 as the test video):
python demo.py --debug --port 8080python inference.py \
--model_path /path/to/model \
--whisper_model_path /path/to/whisper-large-v3 \
--video_path /path/to/video.mp4
# Save structured output to JSON
python inference.py \
--model_path /path/to/model \
--video_path /path/to/video.mp4 \
--output_json result.json| Argument | Default | Description |
|---|---|---|
--model_path |
(required) | Path to the pretrained model checkpoint |
--video_path |
(required) | Path to the input video file |
--whisper_model_path |
None |
Path to Whisper model for audio processing |
--max_new_tokens |
8192 |
Maximum number of tokens to generate |
--repetition_penalty |
1.1 |
Repetition penalty for generation |
--output_json |
None |
Path to save structured JSON output |
Video --> <thinking> (plot + character relationships) --> Structured JSON
{
"meta": {
"title": "...",
"duration": "00:05:00",
"characters": ["Character A", "Character B"]
},
"script": [
{
"scene_id": 1,
"location": "Mansion Living Room",
"type": "Interior",
"environment": "Luxuriously decorated villa...",
"time": "Night",
"mood": "Tense, Suspenseful",
"events": [
{
"timestamp": "00:05",
"character": "Character A",
"action": "Walks in and looks around",
"expression": "Alert",
"dialogue": "Is anyone here?",
"audio_cue": "Creaking door sound"
}
]
}
],
"high_points": [
{
"type": "Emotional Reversal",
"time_range": ["01:20", "01:35"],
"description": "...",
"reasoning": {
"visual": "...",
"audio": "...",
"text": "...",
"psychology": "..."
},
"score": 9.0
}
]
}ARC-OmniScript/
|-- demo.py # Flask web demo with local inference
|-- inference.py # Command-line inference script
|-- setup_env.sh # Environment setup script
|-- requirements.txt
|-- examples/ # Example videos for testing
| |-- demo1.mp4
| |-- demo2.mp4
| |-- demo3.mp4
| +-- demo4.mp4
+-- README.md
If you find this project helpful, please star our repo and cite our technical report:
@article{pu2026omniscript,
title={OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video},
author={Pu, Junfu and Chen, Yuxin and Wang, Teng and Shan, Ying},
journal={arXiv preprint arXiv:2604.11102},
year={2026}
}


