WMPO: World Model-based Policy Optimization for Vision-Language-Action Models
Abstract
WMPO, a pixel-based world-model framework for on-policy VLA RL, enhances sample efficiency, performance, self-correction, and generalization in robotic manipulation.
Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, but their reliance on expert demonstrations limits their ability to learn from failures and perform self-corrections. Reinforcement learning (RL) addresses these through self-improving interactions with the physical environment, but suffers from high sample complexity on real robots. We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel-based predictions that align the "imagined" trajectories with the VLA features pretrained with web-scale images. Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods. Extensive experiments in both simulation and real-robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self-correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.
Community
WMPO trains on-policy, pixel-based world-models for vision-language-action robots, improving sample efficiency and generalization by aligning imagined trajectories with VLA features learned from web-scale images.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators (2025)
- World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training (2025)
- VLA-R1: Enhancing Reasoning in Vision-Language-Action Models (2025)
- Eva-VLA: Evaluating Vision-Language-Action Models' Robustness Under Real-World Physical Variations (2025)
- World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation (2025)
- SITCOM: Scaling Inference-Time COMpute for VLAs (2025)
- VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper