-
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control
Paper • 2506.01943 • Published • 25 -
LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks
Paper • 2506.00411 • Published • 31 -
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper • 2506.01844 • Published • 142
Ron Zhu
RzZ
AI & ML interests
None yet
Organizations
None yet
VLM
-
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Paper • 2312.15715 • Published • 21 -
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Paper • 2505.23747 • Published • 68 -
VideoPrism: A Foundational Visual Encoder for Video Understanding
Paper • 2402.13217 • Published • 38 -
Scaling RL to Long Videos
Paper • 2507.07966 • Published • 158
Robotic
-
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control
Paper • 2506.01943 • Published • 25 -
LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks
Paper • 2506.00411 • Published • 31 -
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper • 2506.01844 • Published • 142
VLM
-
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Paper • 2312.15715 • Published • 21 -
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Paper • 2505.23747 • Published • 68 -
VideoPrism: A Foundational Visual Encoder for Video Understanding
Paper • 2402.13217 • Published • 38 -
Scaling RL to Long Videos
Paper • 2507.07966 • Published • 158