RynnVLA-002: A Unified Vision-Language-Action and World Model Paper • 2511.17502 • Published 7 days ago • 22
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark Paper • 2510.13759 • Published Oct 15 • 9
Simulating the Visual World with Artificial Intelligence: A Roadmap Paper • 2511.08585 • Published 17 days ago • 29
The Quest for Generalizable Motion Generation: Data, Model, and Evaluation Paper • 2510.26794 • Published 29 days ago • 26
RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning Paper • 2510.02240 • Published Oct 2 • 17
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence Paper • 2510.20579 • Published Oct 23 • 55
VideoLucy: Deep Memory Backtracking for Long Video Understanding Paper • 2510.12422 • Published Oct 14 • 1
VBench: Comprehensive Benchmark Suite for Video Generative Models Paper • 2311.17982 • Published Nov 29, 2023 • 9
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models Paper • 2501.08453 • Published Jan 14 • 1
ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models Paper • 2506.21356 • Published Jun 26 • 22
CineScale: Free Lunch in High-Resolution Cinematic Visual Generation Paper • 2508.15774 • Published Aug 21 • 20
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation Paper • 2510.05094 • Published Oct 6 • 37
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs Paper • 2411.15296 • Published Nov 22, 2024 • 21
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Paper • 2501.13826 • Published Jan 23 • 25
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training Paper • 2509.23661 • Published Sep 28 • 44