Improve model card: Add metadata, paper details, and related checkpoints

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +123 -3
README.md CHANGED
@@ -1,8 +1,128 @@
1
  ---
2
  license: mit
 
 
 
 
3
  ---
4
- ### Majority-Voting: Qwen3-8B-Base trained on DAPO-14k
5
 
6
- This is the Qwen3-8B-Base model trained by Majority-Voting using DAPO-14k training set.
7
 
8
- If you are interested in Co-rewarding, you can find more details on our Github Repo [https://github.com/tmlr-group/Co-rewarding].
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
+ tags:
6
+ - reasoning
7
  ---
 
8
 
9
+ # Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models
10
 
11
+ This is the **Majority-Voting: Qwen3-8B-Base** model, trained on the DAPO-14k dataset. This model is part of the research presented in the paper [Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models](https://huggingface.co/papers/2508.00410).
12
+
13
+ Co-rewarding is a novel self-supervised reinforcement learning (RL) framework designed to improve training stability and elicit reasoning abilities in Large Language Models (LLMs) by leveraging complementary supervision signals. This specific checkpoint represents a baseline model trained with the Majority-Voting method as described in the paper.
14
+
15
+ For a comprehensive understanding of the Co-rewarding framework, training procedures, and other related models, please refer to the official GitHub repository: [https://github.com/tmlr-group/Co-rewarding](https://github.com/tmlr-group/Co-rewarding).
16
+
17
+ ## Abstract
18
+ While reinforcement learning with verifiable rewards (RLVR) is effective to improve the reasoning ability of large language models (LLMs), its reliance on human-annotated labels leads to the scaling up dilemma, especially for complex tasks. Recent self-rewarding methods investigate a label-free alternative to unlock the reasoning capabilities of LLMs, yet they frequently encounter the non-negligible training collapse issue, as the single-view supervision signal easily forms the self-consistent illusion, yielding the reward hacking. Inspired by the success of self-supervised learning, we propose \textit{Co-rewarding}, a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views. Specifically, we instantiate Co-rewarding in two ways: (1) \textit{Co-rewarding-I} is a data-side instantiation that derives reward signals from contrastive agreement across semantically analogous questions; and (2) \textit{Co-rewarding-II} is a model-side instantiation that maintains a slowly-updated reference teacher with pseudo labels to realize self-distillation. Intuitively, such instantiations introduce different levels of discrepancy to increase the difficulty of training collapse on trivial reasoning solutions. Empirically, Co-rewarding exhibits stable training across various setups, and outperforms other self-rewarding baselines by $+3.31\%$ improvements on average on multiple mathematical reasoning benchmarks, especially by $+7.49\%$ on Llama-3.2-3B-Instruct. Notably, Co-rewarding reaches or even surpasses RLVR with ground-truth (GT) label in several cases, such as a Pass@$1$ of $94.01\%$ on GSM8K with Qwen3-8B-Base remarkably higher than GT. Our code is publicly available at this https URL .
19
+
20
+ ## Checkpoints
21
+ We release all checkpoints trained by us, including our Co-rewarding-I, Co-rewarding-II and all Baselines.
22
+
23
+ #### Trained on MATH
24
+ | Model Name | Model Size | Method | Hugging Face Link |
25
+ | --- | --- | --- | --- |
26
+ | TMLR-Group-HF/Co-rewarding-I-Qwen2.5-3B-MATH | 3B | Co-rewarding-I | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-I-Qwen2.5-3B-MATH) |
27
+ | TMLR-Group-HF/Co-rewarding-I-Qwen2.5-7B-MATH | 7B | Co-rewarding-I | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-I-Qwen2.5-7B-MATH) |
28
+ | TMLR-Group-HF/Co-rewarding-I-Qwen3-1.7B-Base-MATH | 1.7B | Co-rewarding-I | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-I-Qwen3-1.7B-Base-MATH) |
29
+ | TMLR-Group-HF/Co-rewarding-I-Qwen3-4B-Base-MATH | 4B | Co-rewarding-I | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-I-Qwen3-4B-Base-MATH) |
30
+ | TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-MATH | 8B | Co-rewarding-I | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-MATH) |
31
+ | TMLR-Group-HF/Co-rewarding-I-Llama-3.2-3B-Instruct-MATH | 3B | Co-rewarding-I | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-I-Llama-3.2-3B-Instruct-MATH) |
32
+ | --- | --- | --- | --- |
33
+ | TMLR-Group-HF/Co-rewarding-II-Qwen2.5-3B-MATH | 3B | Co-rewarding-II | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-II-Qwen2.5-3B-MATH) |
34
+ | TMLR-Group-HF/Co-rewarding-II-Qwen2.5-7B-MATH | 7B | Co-rewarding-II | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-II-Qwen2.5-7B-MATH) |
35
+ | TMLR-Group-HF/Co-rewarding-II-Qwen3-1.7B-Base-MATH | 1.7B | Co-rewarding-II | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-II-Qwen3-1.7B-Base-MATH) |
36
+ | TMLR-Group-HF/Co-rewarding-II-Qwen3-4B-Base-MATH | 4B | Co-rewarding-II | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-II-Qwen3-4B-Base-MATH) |
37
+ | TMLR-Group-HF/Co-rewarding-II-Qwen3-8B-Base-MATH | 8B | Co-rewarding-II | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-II-Qwen3-8B-Base-MATH) |
38
+ | TMLR-Group-HF/Co-rewarding-II-Llama-3.2-3B-Instruct-MATH | 3B | Co-rewarding-II | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-II-Llama-3.2-3B-Instruct-MATH) |
39
+ | --- | --- | --- | --- |
40
+ | TMLR-Group-HF/GT-Qwen2.5-3B-MATH | 3B | GT-GRPO | [View Model](https://huggingface.co/TMLR-Group-HF/GT-Qwen2.5-3B-MATH) |
41
+ | TMLR-Group-HF/GT-Qwen2.5-7B-MATH | 7B | GT-GRPO | [View Model](https://huggingface.co/TMLR-Group-HF/GT-Qwen2.5-7B-MATH) |
42
+ | TMLR-Group-HF/GT-Qwen3-1.7B-Base-MATH | 1.7B | GT-GRPO | [View Model](https://huggingface.co/TMLR-Group-HF/GT-Qwen3-1.7B-Base-MATH) |
43
+ | TMLR-Group-HF/GT-Qwen3-4B-Base-MATH | 4B | GT-GRPO | [View Model](https://huggingface.co/TMLR-Group-HF/GT-Qwen3-4B-Base-MATH) |
44
+ | TMLR-Group-HF/GT-Qwen3-8B-Base-MATH | 8B | GT-GRPO | [View Model](https://huggingface.co/TMLR-Group-HF/GT-Qwen3-8B-Base-MATH) |
45
+ | TMLR-Group-HF/GT-Llama-3.2-3B-Instruct-MATH | 3B | GT-GRPO | [View Model](https://huggingface.co/TMLR-Group-HF/GT-Llama-3.2-3B-Instruct-MATH) |
46
+ | --- | --- | --- | --- |
47
+ | TMLR-Group-HF/Self-Certainty-Qwen2.5-3B-MATH | 3B | Self-Certainty Maximization | [View Model](https://huggingface.co/TMLR-Group-HF/Self-Certainty-Qwen2.5-3B-MATH) |
48
+ | TMLR-Group-HF/Self-Certainty-Qwen2.5-7B-MATH | 7B | Self-Certainty Maximization | [View Model](https://huggingface.co/TMLR-Group-HF/Self-Certainty-Qwen2.5-7B-MATH) |
49
+ | TMLR-Group-HF/Self-Certainty-Qwen3-1.7B-Base-MATH | 1.7B | Self-Certainty Maximization | [View Model](https://huggingface.co/TMLR-Group-HF/Self-Certainty-Qwen3-1.7B-Base-MATH) |
50
+ | TMLR-Group-HF/Self-Certainty-Qwen3-4B-Base-MATH | 4B | Self-Certainty Maximization | [View Model](https://huggingface.co/TMLR-Group-HF/Self-Certainty-Qwen3-4B-Base-MATH) |
51
+ | TMLR-Group-HF/Self-Certainty-Qwen3-8B-Base-MATH | 8B | Self-Certainty Maximization | [View Model](https://huggingface.co/TMLR-Group-HF/Self-Certainty-Qwen3-8B-Base-MATH) |
52
+ | TMLR-Group-HF/Self-Certainty-Llama-3.2-3B-Instruct-MATH | 3B | Self-Certainty Maximization | [View Model](https://huggingface.co/TMLR-Group-HF/Self-Certainty-Llama-3.2-3B-Instruct-MATH) |
53
+ | --- | --- | --- | --- |
54
+ | TMLR-Group-HF/Entropy-Qwen2.5-3B-MATH | 3B | Entropy Minimization | [View Model](https://huggingface.co/TMLR-Group-HF/Entropy-Qwen2.5-3B-MATH) |
55
+ | TMLR-Group-HF/Entropy-Qwen2.5-7B-MATH | 7B | Entropy Minimization | [View Model](https://huggingface.co/TMLR-Group-HF/Entropy-Qwen2.5-7B-MATH) |
56
+ | TMLR-Group-HF/Entropy-Qwen3-1.7B-Base-MATH | 1.7B | Entropy Minimization | [View Model](https://huggingface.co/TMLR-Group-HF/Entropy-Qwen3-1.7B-Base-MATH) |
57
+ | TMLR-Group-HF/Entropy-Qwen3-4B-Base-MATH | 4B | Entropy Minimization | [View Model](https://huggingface.co/TMLR-Group-HF/Entropy-Qwen3-4B-Base-MATH) |
58
+ | TMLR-Group-HF/Entropy-Qwen3-8B-Base-MATH | 8B | Entropy Minimization | [View Model](https://huggingface.co/TMLR-Group-HF/Entropy-Qwen3-8B-Base-MATH) |
59
+ | TMLR-Group-HF/Entropy-Llama-3.2-3B-Instruct-MATH | 3B | Entropy Minimization | [View Model](https://huggingface.co/TMLR-Group-HF/Entropy-Llama-3.2-3B-Instruct-MATH) |
60
+ | --- | --- | --- | --- |
61
+ | TMLR-Group-HF/Majority-Voting-Qwen2.5-3B-MATH | 3B | Majority-Voting | [View Model](https://huggingface.co/TMLR-Group-HF/Majority-Voting-Qwen2.5-3B-MATH) |
62
+ | TMLR-Group-HF/Majority-Voting-Qwen2.5-7B-MATH | 7B | Majority-Voting | [View Model](https://huggingface.co/TMLR-Group-HF/Majority-Voting-Qwen2.5-7B-MATH) |
63
+ | TMLR-Group-HF/Majority-Voting-Qwen3-1.7B-Base-MATH | 1.7B | Majority-Voting | [View Model](https://huggingface.co/TMLR-Group-HF/Majority-Voting-Qwen3-1.7B-Base-MATH) |
64
+ | TMLR-Group-HF/Majority-Voting-Qwen3-4B-Base-MATH | 4B | Majority-Voting | [View Model](https://huggingface.co/TMLR-Group-HF/Majority-Voting-Qwen3-4B-Base-MATH) |
65
+ | TMLR-Group-HF/Majority-Voting-Qwen3-8B-Base-MATH | 8B | Majority-Voting | [View Model](https://huggingface.co/TMLR-Group-HF/Majority-Voting-Qwen3-8B-Base-MATH) |
66
+ | TMLR-Group-HF/Majority-Voting-Llama-3.2-3B-Instruct-MATH | 3B | Majority-Voting | [View Model](https://huggingface.co/TMLR-Group-HF/Majority-Voting-Llama-3.2-3B-Instruct-MATH) |
67
+
68
+
69
+ #### Trained on DAPO-14k
70
+ | Model Name | Model Size | Method | Hugging Face Link |
71
+ | --- | --- | --- | --- |
72
+ | TMLR-Group-HF/Co-rewarding-I-Qwen3-4B-Base-DAPO14k | 4B | Co-rewarding-I | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-I-Qwen3-4B-Base-DAPO14k) |
73
+ | TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-DAPO14k | 8B | Co-rewarding-I | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-DAPO14k) |
74
+ | TMLR-Group-HF/Co-rewarding-I-Llama-3.2-3B-Instruct-DAPO14k | 3B | Co-rewarding-I | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-I-Llama-3.2-3B-Instruct-DAPO14k) |
75
+ | --- | --- | --- | --- |
76
+ | TMLR-Group-HF/Co-rewarding-II-Qwen3-4B-Base-DAPO14k | 4B | Co-rewarding-II | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-II-Qwen3-4B-Base-DAPO14k) |
77
+ | TMLR-Group-HF/Co-rewarding-II-Qwen3-8B-Base-DAPO14k | 8B | Co-rewarding-II | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-II-Qwen3-8B-Base-DAPO14k) |
78
+ | TMLR-Group-HF/Co-rewarding-II-Llama-3.2-3B-Instruct-DAPO14k | 3B | Co-rewarding-II | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-II-Llama-3.2-3B-Instruct-DAPO14k) |
79
+ | --- | --- | --- | --- |
80
+ | TMLR-Group-HF/GT-Qwen3-4B-Base-DAPO14k | 4B | GT-GRPO | [View Model](https://huggingface.co/TMLR-Group-HF/GT-Qwen3-4B-Base-DAPO14k) |
81
+ | TMLR-Group-HF/GT-Qwen3-8B-Base-DAPO14k | 8B | GT-GRPO | [View Model](https://huggingface.co/TMLR-Group-HF/GT-Qwen3-8B-Base-DAPO14k) |
82
+ | TMLR-Group-HF/GT-Llama-3.2-3B-Instruct-DAPO14k | 3B | GT-GRPO | [View Model](https://huggingface.co/TMLR-Group-HF/GT-Llama-3.2-3B-Instruct-DAPO14k) |
83
+ | --- | --- | --- | --- |
84
+ | TMLR-Group-HF/Self-Certainty-Qwen3-4B-Base-DAPO14k | 4B | Self-Certainty Maximization | [View Model](https://huggingface.co/TMLR-Group-HF/Self-Certainty-Qwen3-4B-Base-DAPO14k) |
85
+ | TMLR-Group-HF/Self-Certainty-Qwen3-8B-Base-DAPO14k | 8B | Self-Certainty Maximization | [View Model](https://huggingface.co/TMLR-Group-HF/Self-Certainty-Qwen3-8B-Base-DAPO14k) |
86
+ | TMLR-Group-HF/Self-Certainty-Llama-3.2-3B-Instruct-DAPO14k | 3B | Self-Certainty Maximization | [View Model](https://huggingface.co/TMLR-Group-HF/Self-Certainty-Llama-3.2-3B-Instruct-DAPO14k) |
87
+ | --- | --- | --- | --- |
88
+ | TMLR-Group-HF/Entropy-Qwen3-4B-Base-DAPO14k | 4B | Entropy Minimization | [View Model](https://huggingface.co/TMLR-Group-HF/Entropy-Qwen3-4B-Base-DAPO14k) |
89
+ | TMLR-Group-HF/Entropy-Qwen3-8B-Base-DAPO14k | 8B | Entropy Minimization | [View Model](https://huggingface.co/TMLR-Group-HF/Entropy-Qwen3-8B-Base-DAPO14k) |
90
+ | TMLR-Group-HF/Entropy-Llama-3.2-3B-Instruct-DAPO14k | 3B | Entropy Minimization | [View Model](https://huggingface.co/TMLR-Group-HF/Entropy-Llama-3.2-3B-Instruct-DAPO14k) |
91
+ | --- | --- | --- | --- |
92
+ | TMLR-Group-HF/Majority-Voting-Qwen3-4B-Base-DAPO14k | 4B | Majority-Voting | [View Model](https://huggingface.co/TMLR-Group-HF/Majority-Voting-Qwen3-4B-Base-DAPO14k) |
93
+ | TMLR-Group-HF/Majority-Voting-Qwen3-8B-Base-DAPO14k | 8B | Majority-Voting | [View Model](https://huggingface.co/TMLR-Group-HF/Majority-Voting-Qwen3-8B-Base-DAPO14k) |
94
+ | TMLR-Group-HF/Majority-Voting-Llama-3.2-3B-Instruct-DAPO14k | 3B | Majority-Voting | [View Model](https://huggingface.co/TMLR-Group-HF/Majority-Voting-Llama-3.2-3B-Instruct-DAPO14k) |
95
+
96
+
97
+ #### Trained on OpenRS
98
+ | Model Name | Model Size | Method | Hugging Face Link |
99
+ | --- | --- | --- | --- |
100
+ | TMLR-Group-HF/Co-rewarding-I-Qwen3-4B-Base-OpenRS | 4B | Co-rewarding-I | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-I-Qwen3-4B-Base-OpenRS) |
101
+ | TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-OpenRS | 8B | Co-rewarding-I | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-OpenRS) |
102
+ | --- | --- | --- | --- |
103
+ | TMLR-Group-HF/Co-rewarding-II-Qwen3-4B-Base-OpenRS | 4B | Co-rewarding-II | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-II-Qwen3-4B-Base-OpenRS) |
104
+ | TMLR-Group-HF/Co-rewarding-II-Qwen3-8B-Base-OpenRS | 8B | Co-rewarding-II | [View Model](https://huggingface.co/TMLR-Group-HF/Co-rewarding-II-Qwen3-8B-Base-OpenRS) |
105
+ | --- | --- | --- | --- |
106
+ | TMLR-Group-HF/GT-Qwen3-4B-Base-OpenRS | 4B | GT-GRPO | [View Model](https://huggingface.co/TMLR-Group-HF/GT-Qwen3-4B-Base-OpenRS) |
107
+ | TMLR-Group-HF/GT-Qwen3-8B-Base-OpenRS | 8B | GT-GRPO | [View Model](https://huggingface.co/TMLR-Group-HF/GT-Qwen3-8B-Base-OpenRS) |
108
+ | --- | --- | --- | --- |
109
+ | TMLR-Group-HF/Self-Certainty-Qwen3-4B-Base-OpenRS | 4B | Self-Certainty Maximization | [View Model](https://huggingface.co/TMLR-Group-HF/Self-Certainty-Qwen3-4B-Base-OpenRS) |
110
+ | TMLR-Group-HF/Self-Certainty-Qwen3-8B-Base-OpenRS | 8B | Self-Certainty Maximization | [View Model](https://huggingface.co/TMLR-Group-HF/Self-Certainty-Qwen3-8B-Base-OpenRS) |
111
+ | --- | --- | --- | --- |
112
+ | TMLR-Group-HF/Entropy-Qwen3-4B-Base-OpenRS | 4B | Entropy Minimization | [View Model](https://huggingface.co/TMLR-Group-HF/Entropy-Qwen3-4B-Base-OpenRS) |
113
+ | TMLR-Group-HF/Entropy-Qwen3-8B-Base-OpenRS | 8B | Entropy Minimization | [View Model](https://huggingface.co/TMLR-Group-HF/Entropy-Qwen3-8B-Base-OpenRS) |
114
+ | --- | --- | --- | --- |
115
+ | TMLR-Group-HF/Majority-Voting-Qwen3-4B-Base-OpenRS | 4B | Majority-Voting | [View Model](https://huggingface.co/TMLR-Group-HF/Majority-Voting-Qwen3-4B-Base-OpenRS) |
116
+ | TMLR-Group-HF/Majority-Voting-Qwen3-8B-Base-OpenRS | 8B | Majority-Voting | [View Model](https://huggingface.co/TMLR-Group-HF/Majority-Voting-Qwen3-8B-Base-OpenRS) |
117
+
118
+
119
+ ## Citation
120
+ If you use our datasets or models, please cite our paper!
121
+ ```bibtex
122
+ @article{zhang2025co,
123
+ title={Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models},
124
+ author={Zhang, Zizhuo and Zhu, Jianing and Ge, Xinmu and Zhao, Zihua and Zhou, Zhanke and Li, Xuan and Feng, Xiao and Yao, Jiangchao and Han, Bo},
125
+ journal={arXiv preprint arXiv:2508.00410},
126
+ year={2025}
127
+ }
128
+ ```