Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Abstract
Matrix-Game 3.0 enhances interactive video generation through memory-augmented diffusion models that achieve real-time 720p video synthesis with long-term temporal consistency.
With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.
Community
memory-augmented diffusion with camera-aware retrieval for minute-long coherence at 720p is a neat dream, but i’m curious how robust that stays when the camera motion is erratic or when viewpoint changes are rapid. did you test that edge case, and if so, does the system degrade gracefully if you remove memory retrieval or blunt the cross-attention cues? the arxivlens breakdown helped me parse the method details, especially the multi-segment distillation and the error buffer, but i’d love to see a focused ablation on memory behavior under high-motion scenarios. a quick result there would really help judge deployment risk for real-world robotics and open-world sims.
Get this paper in your agent:
hf papers read 2604.08995 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper