File size: 12,868 Bytes
682099f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
---
license: other
license_name: qwen
license_link: https://huggingface.co/Qwen/Qwen2.5-WorldPM-72B/blob/main/LICENSE
language:
- en
pipeline_tag: text-classification
library_name: transformers
tags:
- reward model
- preference model
base_model:
- Qwen/Qwen2.5-72B
---
# Qwen2.5-WorldPM-72B
## Introduction
Scaling laws have revealed that language modeling performance scales as a power-law with model size and dataset size. We argue that similar laws exist in preference modeling. We propose **World** **P**reference **M**odeling (WorldPM) to emphasize this scaling potential. Our insight is that, despite the apparent diversity of human preferences, preference models can learn a unified preference representation through large-scale training.
To put this philosophical concept into practice, we conduct large-scale preference modeling experiments using 15M preference data from StackExchange on models ranging from 1.5B to 72B parameters for **World** **P**reference **M**odeling(WorldPM). Our findings reveal that as training data size increases, the test loss on well-defined tasks demonstrates power law decrease in preference modeling tasks. The figure below demonstrates the loss trends across 12 test sets spanning subjective, objective, and adversarial domains.

Specially, our findings reveal:
1. **In adversarial evaluation, test losses demonstrate a power law decrease**, indicating the model's enhanced ability to identify responses with intentional errors and those that are well-written but irrelevant or incomplete. This finding suggests that subtle deficiencies commonly plaguing discriminative models can be effectively mitigated through large-scale training.
2. **The objective metrics reveal an emergent phenomenon**, where larger models demonstrate a power law decrease in test losses across more benchmarks. Specifically, only the 72B model shows consistent scaling benefits, with diminishing effects in 7B and no improvements in 1.5B. These results suggest that WorldPM represents a challenging task that requires larger models to elicit preferences for objective knowledge, pointing to its substantial potential for further advancement.
3. **Subjective evaluations show no apparent scaling trends.** We analyze potential reasons from the perspective of style preferences. While WorldPM becomes more style-neutral as it scales up, some subjective evaluations exhibit style preferences, resulting in lower evaluation performance.
For more details, please refer to our [paper]().
### Usage Instructions
WorldPM aims to learn a unified representation of human preferences through large-scale preference training. Although experiments show that WorldPM can generalize to various human preference scenarios, for better application in specific scenarios, we recommend fine-tuning using human preference datasets. Experiments demonstrate that WorldPM serves as a good starting point for fine-tuning.
<!-- ## Why Subjective Domains Don't Scale
In our scaling experiments for preference modeling, we observed clear scaling trends in objective domains but not in subjective ones. We attribute this to the multi-dimensional nature of subjective evaluations - the assessment results are essentially averages across many dimensions. This leads to positive scaling in some dimensions and negative scaling in others, resulting in an apparent lack of overall scaling. Notably, as explained in our paper, for certain surface-level dimensions like style, WorldPM overcomes these biases, leading to significantly lower evaluation scores.
## Why Preference Modeling is Scalable
The scalability of preference modeling might seem counterintuitive, with two main concerns: (1) From a task perspective, preference modeling appears too simple with only binary signals (indicating which response is preferred), resulting in sparse supervision. (2) From a data perspective, human forum data appears noisy and seemingly difficult to scale.
Addressing the first concern, consider why next token prediction successfully models language - to accurately predict the next word (e.g., with 90% probability), language models must understand comprehensive language rules. Similarly, to successfully predict 90% of preference dataset labels, models must learn sufficiently universal human preference representations (otherwise, such high accuracy would be impossible).
Regarding the second concern, noise refers to the apparent randomness in labels or supervision signals. However, since forum data represents genuine human annotations, it inherently contains its own rationality. While individual human intelligence might not discern the patterns behind these labels, this doesn't preclude powerful language models from discovering them.
Therefore, we can further infer that neural network scalability might depend neither on dense supervision signals nor on precise supervision signals, as evidenced by supervision signals from human preference data being both sparse and noisy. As long as the supervision signals are reasonable and challenging, scaling is possible - albeit dense and precise supervision signals would significantly expedite the convergence process. -->
## Requirements
*`transformers>=4.40.0` for Qwen2.5-WorldPM-72B model. The latest version is recommended.
> [!Warning]
> <divalign="center">
> <b>
> **🚨 This is a must because `transformers` integrated Qwen2.5 codes since `4.37.0`.**
>
> </div>
For requirements on GPU memory and the respective throughput, see similar results of Qwen2 [here](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
## Quick Start
### Prerequisites
- The system prompt is empty during training, meaning no specific preferences are imposed on the model. However, the system marker is retained.
- The token `<|endoftext|>` is appended at the end to mark the reward, where its hidden state is mapped to compute the reward value.
### 🤗 Hugging Face Transformers
Here we show a code snippet to show you how to use the Qwen2.5-WorldPM-72B with `transformers`:
```python
from transformers import AutoModel, AutoTokenizer, AutoConfig
model_name = 'Qwen/Qwen2.5-WorldPM-72B'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
config = AutoConfig.from_pretrained(model_name,trust_remote_code=True )
model = AutoModel.from_pretrained(
model_name,
config = config,
device_map = "auto",
trust_remote_code=True,
).eval()
def get_score(model, tokenizer, conversation):
con_str = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False)
input_ids = tokenizer.encode(con_str, return_tensors="pt", add_special_tokens=False)
outputs = model(input_ids=input_ids.to(model.device))
score = outputs[0].cpu().item()
return score
messages = [
[
{'role': 'system', 'content': ""},
{'role': 'user', 'content': "Can you deduplicate this list: 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Apples\n 13. Mango\n 14. Watermelon\n 15. Peaches\n<extra_id_1>Assistant\nHere is the deduplicated list:\n \n\n 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Watermelon\n 13. Peaches\n<extra_id_1>User\nSort this list alphabetically\n<extra_id_1>Assistant\nHere is the alphabetized list:\n \n\n 1. Apples\n 2. Bananas\n 3. Grapes\n 4. Kiwi\n 5. Mango\n 6. Oranges\n 7. Peaches\n 8. Pineapples\n 9. Strawberries\n 10. Watermelon\n<extra_id_1>User\nCreate a table with these with a column with sugar content and sort from more sugar to less", "message": "Can you deduplicate this list: 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Apples\n 13. Mango\n 14. Watermelon\n 15. Peaches\n<extra_id_1>Assistant\nHere is the deduplicated list:\n \n\n 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Watermelon\n 13. Peaches\n<extra_id_1>User\nSort this list alphabetically\n<extra_id_1>Assistant\nHere is the alphabetized list:\n \n\n 1. Apples\n 2. Bananas\n 3. Grapes\n 4. Kiwi\n 5. Mango\n 6. Oranges\n 7. Peaches\n 8. Pineapples\n 9. Strawberries\n 10. Watermelon\n<extra_id_1>User\nCreate a table with these with a column with sugar content and sort from more sugar to less"},
{'role': 'assistant', 'content': "Here is a table with the list of fruits, sorted from most sugar to least sugar content:\n \n\n Fruit Sugar Content (per 100g)\n Bananas 16.1g\n Apples 10.4g\n Grapes 16.8g\n Kiwi 8.5g\n Mango 10.6g\n Oranges 6.9g\n Strawberries 5.6g\n Pineapples 8.0g\n Peaches 7.0g\n Watermelon 6.4g\n Note: The sugar content of fruits can vary depending on the variety, ripeness, and other factors. This table is based on average values and should be used as a general guide only.\n \n\n To sort the table from most to least sugar content, you can click on the header of the \"Sugar Content\" column to sort in ascending order. This will show the fruits with the highest sugar content at the top of the list, and the fruits with the lowest sugar content at the bottom of the list. You can also sort the table in descending order by clicking on the header of the \"Sugar Content\" column again.\n \n\n It's important to keep in mind that while sugar is an important part of a balanced diet, it's important to consume it in moderation and to consider the overall nutrition profile of the fruits you choose. In addition to sugar, fruits also provide important vitamins, minerals, and fiber that are essential for a healthy diet. When selecting fruits, it's a good idea to choose a variety of different types to get a range of nutrients and flavors."},
],
[
{'role': 'system', 'content': ""},
{'role': 'user', 'content': "Can you deduplicate this list: 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Apples\n 13. Mango\n 14. Watermelon\n 15. Peaches\n<extra_id_1>Assistant\nHere is the deduplicated list:\n \n\n 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Watermelon\n 13. Peaches\n<extra_id_1>User\nSort this list alphabetically\n<extra_id_1>Assistant\nHere is the alphabetized list:\n \n\n 1. Apples\n 2. Bananas\n 3. Grapes\n 4. Kiwi\n 5. Mango\n 6. Oranges\n 7. Peaches\n 8. Pineapples\n 9. Strawberries\n 10. Watermelon\n<extra_id_1>User\nCreate a table with these with a column with sugar content and sort from more sugar to less", "message": "Can you deduplicate this list: 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Apples\n 13. Mango\n 14. Watermelon\n 15. Peaches\n<extra_id_1>Assistant\nHere is the deduplicated list:\n \n\n 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Watermelon\n 13. Peaches\n<extra_id_1>User\nSort this list alphabetically\n<extra_id_1>Assistant\nHere is the alphabetized list:\n \n\n 1. Apples\n 2. Bananas\n 3. Grapes\n 4. Kiwi\n 5. Mango\n 6. Oranges\n 7. Peaches\n 8. Pineapples\n 9. Strawberries\n 10. Watermelon\n<extra_id_1>User\nCreate a table with these with a column with sugar content and sort from more sugar to less"},
{'role': 'assistant', 'content': "Here is a table with the sugar content of the fruits in the list, sorted from more sugar to less:\n \n\n Fruit Sugar Content (g/100g)\n Grapes 15.4\n Bananas 14.4\n Mango 14.2\n Pineapples 11.8\n Apples 10.4\n Oranges 9.3\n Kiwi 9.3\n Strawberries 7.4\n Peaches 7.0\n Watermelon 6.2\n Note: The sugar content of fruits can vary depending on factors such as ripeness, variety, and growing conditions. The values in this table are approximate and may not be exact."},
]
]
score = [get_score(model, tokenizer=tokenizer, conversation=message) for message in messages]
print(score) # [0.14022722840309143, 1.9332066774368286]
```
## Citation
If you find our work helpful, feel free to give us a citation.
```
@article{prmlessons,
title={ModelingWorldPreference},
author={
Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, An Yang, Dayiheng Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Bowen Yu, Jingren Zhou, and Junyang Lin
},
journal={arXiv preprint arXiv:2505.10527},
year={2025}
}
```
|