Video-R1: Reinforcing Video Reasoning in MLLMs
This repository contains Video-R1/Qwen2.5-VL-7B-COT-SFT, the SFT (Supervised Fine-Tuning) cold start model trained using the Video-R1-COT-165k dataset. This intermediate checkpoint serves as the base model for further RL (Reinforcement Learning) training on the Video-R1-260k dataset to produce the final Video-R1 models.
For more details, please refer to the paper: Video-R1: Reinforcing Video Reasoning in MLLMs. The full code and additional resources are available on the GitHub repository.
About Video-R1
Video-R1 represents the first systematic exploration of the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs), inspired by the success of DeepSeek-R1. The project addresses key challenges in video reasoning, particularly the lack of temporal modeling and the scarcity of high-quality video-reasoning data.
To tackle these issues, Video-R1 proposes the T-GRPO algorithm, an extension of GRPO that explicitly encourages models to leverage temporal information in videos for reasoning. It also strategically incorporates high-quality image-reasoning data into the training process. The model was trained on two newly constructed datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data.
Experimental results demonstrate that Video-R1 achieves significant improvements on various video reasoning benchmarks, including VideoMMMU, VSI-Bench, MVBench, and TempCompass. Notably, Video-R1-7B has shown competitive performance, even surpassing proprietary models like GPT-4o on certain video spatial reasoning tasks.
Sample Usage
We provide a simple generation process for using this SFT cold start model with the transformers library.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
from PIL import Image
import cv2
from decord import VideoReader, cpu
# Load model, tokenizer, and processor
model_id = "Video-R1/Qwen2.5-VL-7B-COT-SFT" # This specific SFT checkpoint
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="cuda",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Function to load video frames
def load_video_frames(video_path, num_frames=16):
vr = VideoReader(video_path, ctx=cpu(0))
total_frames = len(vr)
indices = [int(i * (total_frames / num_frames)) for i in range(num_frames)]
frames = vr.get_batch(indices).asnumpy()
frames = [Image.fromarray(frame) for frame in frames]
return frames
# Example usage
# Replace with your actual video path
# For demonstration, ensure a video file like 'examples/video1.mp4' exists or adjust path
video_path = "./examples/video1.mp4"
frames = load_video_frames(video_path)
text = "Describe this video in detail."
# Prepare inputs
inputs = processor(frames=frames, text=text, return_tensors="pt").to("cuda")
# Generate response
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Citation
If you find our work helpful for your research, please consider citing our work:
@article{feng2025video,
title={Video-R1: Reinforcing Video Reasoning in MLLMs},
author={Feng, Kaituo and Gong, Kaixiong and Li, Bohao and Guo, Zonghao and Wang, Yibing and Peng, Tianshuo and Wang, Benyou and Yue, Xiangyu},
journal={arXiv preprint arXiv:2503.21776},
year={2025}
}
- Downloads last month
- 28,661