From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Paper • 2510.14979 • Published 11 days ago • 64
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation Paper • 2510.08673 • Published 18 days ago • 117
Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model Paper • 2508.13009 • Published Aug 18 • 25
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation Paper • 2503.21979 • Published Mar 27 • 4
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency Paper • 2503.20785 • Published Mar 26 • 22