D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI Paper • 2510.05684 • Published Oct 7 • 139
Seeing Voices: Generating A-Roll Video from Audio with Mirage Paper • 2506.08279 • Published Jun 9 • 27
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features Paper • 2504.00557 • Published Apr 1 • 15