Skip to content

TencentARC/OmniScript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

Junfu Pu*, Yuxin Chen*, Teng Wang*, Ying Shan
ARC Lab, Tencent
*Equal Contribution

arXiv   Demo   Static Badge   Static Badge Blog   Example

Note: Our code and model weights are currently undergoing internal open-source review. They will be publicly released once the review is complete.

News

Introduction

OmniScript is an 8B-parameter omni-modal (audio + visual) language model built on Qwen3VL-8B, designed for the Video-to-Script (V2S) task: converting long-form cinematic videos into hierarchical, scene-by-scene structured screenplays.

Given a video (up to 5 minutes natively, or longer via two-stage generation), OmniScript produces:

  • Meta - title,, character list
  • Scene-level script - location, environment, time, mood for each scene
  • Event-level script - timestamped character actions, dialogues, expressions, audio cues, and subtext
  • Chain-of-Thought reasoning - plot summary and character relationship analysis before structured output

The model is trained via a progressive pipeline: modality alignment (1M videos) → multimodal pretraining (2.4M videos) → CoT supervised fine-tuning (45K videos) → reinforcement learning with temporally segmented rewards (GRPO). Despite its 8B parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to Gemini 3-Pro on both temporal localization and multi-field semantic accuracy.

Key Results

Event-level (5-min videos)

Model Params Omni Char. Dia. Act. Exp. Aud. Overall tIoU@0.1
Proprietary Models
Gemini-3-flash - Yes 28.8 50.3 28.2 25.5 11.2 28.8 44.3
Gemini-3-pro - Yes 39.8 68.8 37.4 35.4 13.3 38.9 64.4
Gemini-2.5-flash - Yes 40.1 75.5 42.8 36.5 22.8 43.6 74.3
Gemini-2.5-pro - Yes 41.7 75.0 41.9 39.0 17.0 42.9 73.4
Seed-1.8 - No 40.9 54.4 35.1 29.6 12.4 34.5 50.7
Seed-2.0-pro - No 47.4 68.1 42.9 35.7 10.3 40.9 67.1
Open-source Models
Qwen3VL 8B No 30.4 49.6 26.9 25.3 6.6 27.7 47.6
Qwen3VL 32B No 37.1 57.1 31.3 28.7 7.2 32.3 52.5
Qwen3VL 235B/22B No 38.1 58.6 33.0 29.1 6.0 33.0 62.0
TimeChat-Captioner 8B Yes 6.9 6.9 6.6 9.7 8.2 7.7 16.1
MiniCPM-O-4.5 9B Yes 3.1 8.0 2.9 3.2 2.4 3.9 3.2
OmniScript (Ours) 8B Yes 39.2 72.2 33.7 31.9 11.6 37.7 69.3

Scene-level (5-min videos)

Model Params Omni Loc. Type Env. Time Mood Overall tIoU@0.1
Proprietary Models
Gemini-3-flash - Yes 54.6 59.8 42.7 54.9 50.4 52.5 70.3
Gemini-3-pro - Yes 58.8 63.1 46.9 61.6 54.8 57.0 75.3
Gemini-2.5-flash - Yes 52.8 57.1 45.7 56.1 50.3 52.3 69.6
Gemini-2.5-pro - Yes 56.6 62.4 50.8 60.1 54.6 56.9 74.1
Seed-1.8 - No 57.9 58.6 47.7 58.7 52.8 55.1 74.0
Seed-2.0-pro - No 57.7 62.2 49.2 62.7 54.3 57.2 75.5
Open-source Models
Qwen3-Omni 30B/3B Yes 18.4 26.0 14.6 23.6 22.4 21.0 29.6
Qwen3VL 8B No 41.3 49.7 31.8 39.8 41.7 40.9 60.6
Qwen3VL 32B No 50.4 58.7 42.7 55.4 47.9 51.0 71.1
Qwen3VL 235B/22B No 52.6 60.2 45.4 57.9 50.9 53.4 72.8
TimeChat-Captioner 8B Yes 19.9 29.5 17.3 28.8 30.5 25.2 46.6
MiniCPM-O-4.5 9B Yes 10.3 22.0 8.4 17.7 17.4 15.1 32.0
OmniScript (Ours) 8B Yes 54.0 58.4 41.9 58.1 49.5 52.4 74.6

With only 8B parameters, OmniScript outperforms all open-source models (including Qwen3VL-235B) and achieves performance comparable to proprietary models like Gemini-3-Pro on both event-level and scene-level metrics.

Online Demo & API

Online Demo

We provide an Online OmniScript Demo hosted on the ARC Lab website where you can upload a video and experience OmniScript directly - no local setup required.

How to find demo on ARC Lab Homepage:

ARC Lab -> AI Demo -> Register with Phone No. -> Multimodal Comprehension and Generation -> ARC-OmniScript

API Service

We provide model access via API service. A brief tutorial on how to use the API is as follow. For more details, please refer to the documentation.

Prior to using the OmniScript API, obtaining an access token (ARC-Token) is mandatory. Users who are not logged in are required to complete account verification first.

Steps to get your token:

  1. Log in: Visit ARC Website and log in with your mobile number.
  2. Retrieve Token: Once logged in, click the user icon in the top-right corner and select "View Token" from the dropdown menu to get your ARC-TOKEN.

Requirements

  • Python >= 3.10
  • CUDA-capable GPU (16 GB+ VRAM recommended)
  • Conda
source setup_env.sh

This will create a conda environment omniscript and install all dependencies (PyTorch, FFmpeg, Flash Attention, etc.).

Local Web Demo

A Flask-based interface for uploading videos and viewing structured screenplay results with interactive timestamp navigation.

python demo.py \
    --model_path /path/to/model \
    --whisper_model_path /path/to/whisper-large-v3 \
    --port 8080
Argument Default Description
--model_path (required) Path to the pretrained model checkpoint
--whisper_model_path None Path to Whisper model for audio processing
--port 8080 Port for the Flask server
--host 0.0.0.0 Listening address
--debug False Use built-in sample data, no GPU needed

If --whisper_model_path is not specified, openai/whisper-large-v3 will be downloaded automatically from HuggingFace.

To quickly preview the UI without a GPU (use examples/debug.mp4 as the test video):

python demo.py --debug --port 8080

demo

Command-Line Inference

python inference.py \
    --model_path /path/to/model \
    --whisper_model_path /path/to/whisper-large-v3 \
    --video_path /path/to/video.mp4

# Save structured output to JSON
python inference.py \
    --model_path /path/to/model \
    --video_path /path/to/video.mp4 \
    --output_json result.json
Argument Default Description
--model_path (required) Path to the pretrained model checkpoint
--video_path (required) Path to the input video file
--whisper_model_path None Path to Whisper model for audio processing
--max_new_tokens 8192 Maximum number of tokens to generate
--repetition_penalty 1.1 Repetition penalty for generation
--output_json None Path to save structured JSON output

Output Schema

Video --> <thinking> (plot + character relationships) --> Structured JSON

example

{
  "meta": {
    "title": "...",
    "duration": "00:05:00",
    "characters": ["Character A", "Character B"]
  },
  "script": [
    {
      "scene_id": 1,
      "location": "Mansion Living Room",
      "type": "Interior",
      "environment": "Luxuriously decorated villa...",
      "time": "Night",
      "mood": "Tense, Suspenseful",
      "events": [
        {
          "timestamp": "00:05",
          "character": "Character A",
          "action": "Walks in and looks around",
          "expression": "Alert",
          "dialogue": "Is anyone here?",
          "audio_cue": "Creaking door sound"
        }
      ]
    }
  ],
  "high_points": [
    {
      "type": "Emotional Reversal",
      "time_range": ["01:20", "01:35"],
      "description": "...",
      "reasoning": {
        "visual": "...",
        "audio": "...",
        "text": "...",
        "psychology": "..."
      },
      "score": 9.0
    }
  ]
}

File Structure

ARC-OmniScript/
|-- demo.py              # Flask web demo with local inference
|-- inference.py         # Command-line inference script
|-- setup_env.sh         # Environment setup script
|-- requirements.txt
|-- examples/            # Example videos for testing
|   |-- demo1.mp4
|   |-- demo2.mp4
|   |-- demo3.mp4
|   +-- demo4.mp4
+-- README.md

Citation

If you find this project helpful, please star our repo and cite our technical report:

@article{pu2026omniscript,
  title={OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video},
  author={Pu, Junfu and Chen, Yuxin and Wang, Teng and Shan, Ying},
  journal={arXiv preprint arXiv:2604.11102},
  year={2026}
}

About

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors