nielsr HF Staff commited on
Commit
3aa4c0f
·
verified ·
1 Parent(s): 6e19da7

Improve model card: Add pipeline tag, library name, paper, and GitHub links

Browse files

This PR enhances the model card by:
- Adding `pipeline_tag: text-generation` for better discoverability.
- Specifying `library_name: transformers` to enable automated code snippets for usage.
- Including a link to the paper: [Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards](https://huggingface.co/papers/2509.24981).
- Adding a link to the official GitHub repository: https://github.com/tinnerhrhe/ROVER.

These updates make the model card more informative and user-friendly.

Files changed (1) hide show
  1. README.md +38 -2
README.md CHANGED
@@ -1,5 +1,41 @@
1
  ---
2
- license: mit
3
  base_model:
4
  - Qwen/Qwen2.5-3B
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  base_model:
3
  - Qwen/Qwen2.5-3B
4
+ license: mit
5
+ library_name: transformers
6
+ pipeline_tag: text-generation
7
+ ---
8
+
9
+ # Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards
10
+
11
+ This repository contains the model presented in the paper [Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards](https://huggingface.co/papers/2509.24981).
12
+
13
+ **ROVER** (Random Policy Valuation for Diverse Reasoning) is a minimalist yet highly effective Reinforcement Learning (RL) method for Large Language Model (LLM) reasoning. It achieves superior optimality and diversity by evaluating uniform-policy Q-values, bypassing complex policy iteration loops typically found in methods like PPO and GRPO. This approach is particularly effective for math reasoning tasks, preserving diversity throughout training for sustained exploration of multiple valid pathways.
14
+
15
+ - **Paper**: [Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards](https://huggingface.co/papers/2509.24981)
16
+ - **Code**: [https://github.com/tinnerhrhe/ROVER](https://github.com/tinnerhrhe/ROVER)
17
+
18
+ ## Main Results and Features
19
+ <img width="1217" alt="ROVER teaser image" src="https://github.com/tinnerhrhe/ROVER/raw/main/teaser.png" />
20
+ *Figure 1: (a) ROVER achieves superior performances in terms of both pass@1 and pass@256 (trained on Qwen3-8B-Base averaged over AIME24, AIME24 and HMMT25 tasks). (b) Illustrative example demonstrating that ROVER achieves high-quality solutions with a lightweight procedure (see Table below for details) while maintaining diversity. (c) ROVER achieves higher diversity.*
21
+
22
+ ROVER needs minimal GPU memory and computation cost, leaving more space for the KV cache. This allows ROVER to run on smaller memory setups and speeds up training:
23
+
24
+ | Method | Memory Usage of Model Parameters |
25
+ | ------------ | ----------------------------------------------- |
26
+ | ROVER (Ours) | **Low**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(actor model ONLY!😊) |
27
+ | GRPO | **Medium** (actor + reference model) |
28
+ | PPO | **High**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(actor + reference + critic model) |
29
+
30
+ For installation, training, and evaluation instructions, please refer to the [GitHub repository](https://github.com/tinnerhrhe/ROVER).
31
+
32
+ ## Citation
33
+ If you find the project useful, please consider citing our paper:
34
+ ```bib
35
+ @article{he2025randompolicyvaluation,
36
+ title={Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards},
37
+ author={Haoran He and Yuxiao Ye and Qingpeng Cai and Chen Hu and Binxing Jiao and Daxin Jiang and Ling Pan},
38
+ journal={arXiv preprint arXiv:2509.24981},
39
+ year={2025}
40
+ }
41
+ ```