OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

[🐙 GitHub Repository] • [📄 Paper PDF] • [🏠 Project Page]

📖 Introduction

OmniVTG is a multimodal model designed to tackle Video Temporal Grounding (VTG). It aims to accurately localize specific video segments within untrimmed videos based on natural language queries.

Extending VTG to open-world applications has historically been challenging due to the limited scale and semantic diversity of existing datasets. To address this, we introduce the OmniVTG Dataset (featuring over 2,000 hours of rich, diverse video content) and a novel Self-Correction Chain-of-Thought (CoT) training paradigm. This combination unleashes the grounding capabilities of Multimodal Large Language Models (MLLMs).

This repository contains the official model weights for OmniVTG-7B, accepted at CVPR 2026.

✨ Highlights

Open-World Readiness: Powered by a large-scale dataset featuring over 2,000 hours of video content with rich semantic diversity.
Strong Zero-Shot Performance: Achieves robust zero-shot localization performance across four major VTG benchmarks (Charades-STA, ActivityNet Captions, QVHighlights, and TVGBench).
Novel Training Paradigm: Trained via an advanced pipeline consisting of Supervised Fine-Tuning (SFT), Self-Correction CoT Tuning, and Reinforcement Learning (RL).

🚀 Quick Start

To use OmniVTG-7B, please refer to our official codebase for full installation and inference instructions.

Clone the repository and install dependencies:

git clone https://github.com/oceanflowlab/OmniVTG
cd OmniVTG

Download this checkpoint and launch the interactive demo:

python demo.py --model /path/to/OmniVTG-7B

For complete details on evaluation, evaluation datasets, and the full training pipeline (SFT, CoT, RL), please visit our GitHub Repository.

📝 Citation

If you find our work or model helpful for your research, please consider citing our paper:

@inproceedings{zheng2026omnivtg,
    title={OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding}, 
    author={Zheng, Minghang and Yin, Zihao and Yang, Yi and Peng, Yuxin and Liu, Yang},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2026},
}

Downloads last month: 28

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for zhengmh/OmniVTG-7B

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

Paper • 2604.25276 • Published 7 days ago