VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation Paper • 2510.14902 • Published 11 days ago • 13
VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators Paper • 2510.00406 • Published 27 days ago • 63
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model Paper • 2509.09372 • Published Sep 11 • 233
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality Paper • 2405.21060 • Published May 31, 2024 • 67
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models Paper • 2404.09204 • Published Apr 14, 2024 • 11
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length Paper • 2404.08801 • Published Apr 12, 2024 • 66
BRAVE: Broadening the visual encoding of vision-language models Paper • 2404.07204 • Published Apr 10, 2024 • 19
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention Paper • 2404.07143 • Published Apr 10, 2024 • 111
Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models Paper • 2404.04478 • Published Apr 6, 2024 • 13
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models Paper • 2404.02258 • Published Apr 2, 2024 • 107
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction Paper • 2404.02905 • Published Apr 3, 2024 • 74
GeRM: A Generalist Robotic Model with Mixture-of-experts for Quadruped Robot Paper • 2403.13358 • Published Mar 20, 2024 • 3
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference Paper • 2403.14520 • Published Mar 21, 2024 • 35