Video-R1: Reinforcing Video Reasoning in MLLMs

This repository contains Video-R1/Qwen2.5-VL-7B-COT-SFT, the SFT (Supervised Fine-Tuning) cold start model trained using the Video-R1-COT-165k dataset. This intermediate checkpoint serves as the base model for further RL (Reinforcement Learning) training on the Video-R1-260k dataset to produce the final Video-R1 models.

For more details, please refer to the paper: Video-R1: Reinforcing Video Reasoning in MLLMs. The full code and additional resources are available on the GitHub repository.

About Video-R1

Video-R1 represents the first systematic exploration of the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs), inspired by the success of DeepSeek-R1. The project addresses key challenges in video reasoning, particularly the lack of temporal modeling and the scarcity of high-quality video-reasoning data.

To tackle these issues, Video-R1 proposes the T-GRPO algorithm, an extension of GRPO that explicitly encourages models to leverage temporal information in videos for reasoning. It also strategically incorporates high-quality image-reasoning data into the training process. The model was trained on two newly constructed datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data.

Experimental results demonstrate that Video-R1 achieves significant improvements on various video reasoning benchmarks, including VideoMMMU, VSI-Bench, MVBench, and TempCompass. Notably, Video-R1-7B has shown competitive performance, even surpassing proprietary models like GPT-4o on certain video spatial reasoning tasks.

Sample Usage

We provide a simple generation process for using this SFT cold start model with the transformers library.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
from PIL import Image
import cv2
from decord import VideoReader, cpu

# Load model, tokenizer, and processor
model_id = "Video-R1/Qwen2.5-VL-7B-COT-SFT" # This specific SFT checkpoint
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Function to load video frames
def load_video_frames(video_path, num_frames=16):
    vr = VideoReader(video_path, ctx=cpu(0))
    total_frames = len(vr)
    indices = [int(i * (total_frames / num_frames)) for i in range(num_frames)]
    frames = vr.get_batch(indices).asnumpy()
    frames = [Image.fromarray(frame) for frame in frames]
    return frames

# Example usage
# Replace with your actual video path
# For demonstration, ensure a video file like 'examples/video1.mp4' exists or adjust path
video_path = "./examples/video1.mp4" 
frames = load_video_frames(video_path)
text = "Describe this video in detail."

# Prepare inputs
inputs = processor(frames=frames, text=text, return_tensors="pt").to("cuda")

# Generate response
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Citation

If you find our work helpful for your research, please consider citing our work:

@article{feng2025video,
  title={Video-R1: Reinforcing Video Reasoning in MLLMs},
  author={Feng, Kaituo and Gong, Kaixiong and Li, Bohao and Guo, Zonghao and Wang, Yibing and Peng, Tianshuo and Wang, Benyou and Yue, Xiangyu},
  journal={arXiv preprint arXiv:2503.21776},
  year={2025}
}

Downloads last month: 28,661

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Video-R1/Qwen2.5-VL-7B-COT-SFT

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

(2812)

this model

Video-R1
/

Qwen2.5-VL-7B-COT-SFT

Video-R1: Reinforcing Video Reasoning in MLLMs

About Video-R1

Sample Usage

Citation

Model tree for Video-R1/Qwen2.5-VL-7B-COT-SFT

Dataset used to train Video-R1/Qwen2.5-VL-7B-COT-SFT