ROVER-Qwen3-8B: Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

This model is the ROVER-Qwen3-8B checkpoint presented in the paper Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards.

ROVER (Random Policy Valuation for Diverse Reasoning) is a minimalist yet highly effective Reinforcement Learning (RL) method for Large Language Model (LLM) reasoning. It achieves superior optimality and diversity by evaluating uniform-policy Q-values, thereby bypassing complex policy optimization frameworks like PPO and GRPO. This approach is particularly effective for math reasoning tasks with verifiable rewards, demonstrating strong performance while requiring minimal GPU memory and computation cost.

For further details, including installation, training, and evaluation procedures, please refer to the official GitHub repository.

Main Results and Features

*Figure 1: (a) ROVER achieves superior performances in terms of both pass@1 and pass@256 (trained on Qwen3-8B-Base averaged over AIME24, AIME24 and HMMT25 tasks). (b) Illustrative example demonstrating that ROVER achieves high-quality solutions with a lightweight procedure (see Table below for details) while maintaining diversity. (c) ROVER achieves higher diversity.*

ROVER needs minimal GPU memory and computation cost, leaving more space for the KV cache. This allows ROVER to run on smaller memory setups and speeds up training:

Method	Memory Usage of Model Parameters
ROVER (Ours)	Low (actor model ONLY!😊)
GRPO	Medium (actor + reference model)
PPO	High (actor + reference + critic model)

Citation

If you find the project useful, please consider citing our paper:

@article{he2025randompolicyvaluation,
      title={Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards}, 
      author={Haoran He and Yuxiao Ye and Qingpeng Cai and Chen Hu and Binxing Jiao and Daxin Jiang and Ling Pan},
      journal={arXiv preprint arXiv:2509.24981},
      year={2025}
}

Downloads last month: 152

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for haoranhe/ROVER-Qwen3-8B

Quantizations

1 model