Learning to Reason as Action Abstractions with Scalable Mid-Training RL Paper • 2509.25810 • Published Sep 30 • 5
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model Paper • 2509.00676 • Published Aug 31 • 83
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs Paper • 2506.10128 • Published Jun 11 • 22
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning Paper • 2506.05523 • Published Jun 5 • 34
Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning Paper • 2505.20561 • Published May 26 • 7
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs Paper • 2410.08067 • Published Oct 10, 2024 • 2
Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency Paper • 2309.17382 • Published Sep 29, 2023 • 5
Offline Reinforcement Learning for LLM Multi-Step Reasoning Paper • 2412.16145 • Published Dec 20, 2024 • 38
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment Paper • 2405.19332 • Published May 29, 2024 • 22