FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies
Abstract
FLOWER, a 950 M-parameter VLA policy, achieves competitive performance with reduced computational costs through intermediate-modality fusion and action-specific Global-AdaLN conditioning.
Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to 50% of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by 20% through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers competitive performance with bigger VLAs across 190 tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. In addition, FLOWER achieves a new SoTA of 4.53 on the CALVIN ABC benchmark. Demos, code and pretrained weights are available at https://intuitive-robots.github.io/flower_vla/.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- 3D FlowMatch Actor: Unified 3D Policy for Single- and Dual-Arm Manipulation (2025)
- Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation (2025)
- FPC-VLA: A Vision-Language-Action Framework with a Supervisor for Failure Prediction and Correction (2025)
- CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing&Sparsification (2025)
- Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy (2025)
- MolmoAct: Action Reasoning Models that can Reason in Space (2025)
- Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
thanks
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
