SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics Paper • 2506.01844 • Published Jun 2 • 140
CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation Paper • 2505.21904 • Published May 28 • 3
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning Paper • 2505.24871 • Published May 30 • 23
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models Paper • 2505.24025 • Published May 29 • 27
view article Article A Dive into Pretraining Strategies for Vision-Language Models Feb 3, 2023 • 77