File size: 12,868 Bytes
682099f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
license: other

license_name: qwen

license_link: https://huggingface.co/Qwen/Qwen2.5-WorldPM-72B/blob/main/LICENSE

language:

- en

pipeline_tag: text-classification

library_name: transformers

tags:

- reward model
- preference model

base_model:

- Qwen/Qwen2.5-72B
---
# Qwen2.5-WorldPM-72B

## Introduction

Scaling laws have revealed that language modeling performance scales as a power-law with model size and dataset size. We argue that similar laws exist in preference modeling. We propose **World** **P**reference **M**odeling (WorldPM) to emphasize this scaling potential. Our insight is that, despite the apparent diversity of human preferences, preference models can learn a unified preference representation through large-scale training.

To put this philosophical concept into practice, we conduct large-scale preference modeling experiments using 15M preference data from StackExchange on models ranging from 1.5B to 72B parameters for **World** **P**reference **M**odeling(WorldPM). Our findings reveal that as training data size increases, the test loss on well-defined tasks demonstrates power law decrease in preference modeling tasks. The figure below demonstrates the loss trends across 12 test sets spanning subjective, objective, and adversarial domains. 

![main-loss](main-loss.png)


Specially, our findings reveal:

1. **In adversarial evaluation, test losses demonstrate a power law decrease**, indicating the model's enhanced ability to identify responses with intentional errors and those that are well-written but irrelevant or incomplete. This finding suggests that subtle deficiencies commonly plaguing discriminative models can be effectively mitigated through large-scale training.
2. **The objective metrics reveal an emergent phenomenon**, where larger models demonstrate a power law decrease in test losses across more benchmarks. Specifically, only the 72B model shows consistent scaling benefits, with diminishing effects in 7B and no improvements in 1.5B. These results suggest that WorldPM represents a challenging task that requires larger models to elicit preferences for objective knowledge, pointing to its substantial potential for further advancement.
3. **Subjective evaluations show no apparent scaling trends.** We analyze potential reasons from the perspective of style preferences. While WorldPM becomes more style-neutral as it scales up, some subjective evaluations exhibit style preferences, resulting in lower evaluation performance. 

For more details, please refer to our [paper]().

### Usage Instructions
WorldPM aims to learn a unified representation of human preferences through large-scale preference training. Although experiments show that WorldPM can generalize to various human preference scenarios, for better application in specific scenarios, we recommend fine-tuning using human preference datasets. Experiments demonstrate that WorldPM serves as a good starting point for fine-tuning.

<!-- ## Why Subjective Domains Don't Scale

In our scaling experiments for preference modeling, we observed clear scaling trends in objective domains but not in subjective ones. We attribute this to the multi-dimensional nature of subjective evaluations - the assessment results are essentially averages across many dimensions. This leads to positive scaling in some dimensions and negative scaling in others, resulting in an apparent lack of overall scaling. Notably, as explained in our paper, for certain surface-level dimensions like style, WorldPM overcomes these biases, leading to significantly lower evaluation scores.

## Why Preference Modeling is Scalable

The scalability of preference modeling might seem counterintuitive, with two main concerns: (1) From a task perspective, preference modeling appears too simple with only binary signals (indicating which response is preferred), resulting in sparse supervision. (2) From a data perspective, human forum data appears noisy and seemingly difficult to scale.

Addressing the first concern, consider why next token prediction successfully models language - to accurately predict the next word (e.g., with 90% probability), language models must understand comprehensive language rules. Similarly, to successfully predict 90% of preference dataset labels, models must learn sufficiently universal human preference representations (otherwise, such high accuracy would be impossible).

Regarding the second concern, noise refers to the apparent randomness in labels or supervision signals. However, since forum data represents genuine human annotations, it inherently contains its own rationality. While individual human intelligence might not discern the patterns behind these labels, this doesn't preclude powerful language models from discovering them.

Therefore, we can further infer that neural network scalability might depend neither on dense supervision signals nor on precise supervision signals, as evidenced by supervision signals from human preference data being both sparse and noisy. As long as the supervision signals are reasonable and challenging, scaling is possible  - albeit dense and precise supervision signals would significantly expedite the convergence process. -->

## Requirements

*`transformers>=4.40.0` for Qwen2.5-WorldPM-72B model. The latest version is recommended.

> [!Warning]

> <divalign="center">

> <b>

> **🚨 This is a must because `transformers` integrated Qwen2.5 codes since `4.37.0`.**

>

> </div>

For requirements on GPU memory and the respective throughput, see similar results of Qwen2 [here](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).

## Quick Start

### Prerequisites

- The system prompt is empty during training, meaning no specific preferences are imposed on the model. However, the system marker is retained.
- The token `<|endoftext|>` is appended at the end to mark the reward, where its hidden state is mapped to compute the reward value.

### 🤗 Hugging Face Transformers

Here we show a code snippet to show you how to use the Qwen2.5-WorldPM-72B with `transformers`:

```python
from transformers import AutoModel, AutoTokenizer, AutoConfig

model_name = 'Qwen/Qwen2.5-WorldPM-72B'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
config = AutoConfig.from_pretrained(model_name,trust_remote_code=True )
model = AutoModel.from_pretrained(
    model_name, 
    config = config, 
    device_map = "auto", 
    trust_remote_code=True,
).eval()


def get_score(model, tokenizer, conversation):
    
    con_str = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False)
    input_ids = tokenizer.encode(con_str, return_tensors="pt", add_special_tokens=False)
    outputs = model(input_ids=input_ids.to(model.device))
    score = outputs[0].cpu().item()
    return score


messages = [
    [
        {'role': 'system', 'content': ""},
        {'role': 'user', 'content': "Can you deduplicate this list: 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Apples\n 13. Mango\n 14. Watermelon\n 15. Peaches\n<extra_id_1>Assistant\nHere is the deduplicated list:\n \n\n 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Watermelon\n 13. Peaches\n<extra_id_1>User\nSort this list alphabetically\n<extra_id_1>Assistant\nHere is the alphabetized list:\n \n\n 1. Apples\n 2. Bananas\n 3. Grapes\n 4. Kiwi\n 5. Mango\n 6. Oranges\n 7. Peaches\n 8. Pineapples\n 9. Strawberries\n 10. Watermelon\n<extra_id_1>User\nCreate a table with these with a column with sugar content and sort from more sugar to less", "message": "Can you deduplicate this list: 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Apples\n 13. Mango\n 14. Watermelon\n 15. Peaches\n<extra_id_1>Assistant\nHere is the deduplicated list:\n \n\n 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Watermelon\n 13. Peaches\n<extra_id_1>User\nSort this list alphabetically\n<extra_id_1>Assistant\nHere is the alphabetized list:\n \n\n 1. Apples\n 2. Bananas\n 3. Grapes\n 4. Kiwi\n 5. Mango\n 6. Oranges\n 7. Peaches\n 8. Pineapples\n 9. Strawberries\n 10. Watermelon\n<extra_id_1>User\nCreate a table with these with a column with sugar content and sort from more sugar to less"},
        {'role': 'assistant', 'content': "Here is a table with the list of fruits, sorted from most sugar to least sugar content:\n \n\n Fruit Sugar Content (per 100g)\n Bananas 16.1g\n Apples 10.4g\n Grapes 16.8g\n Kiwi 8.5g\n Mango 10.6g\n Oranges 6.9g\n Strawberries 5.6g\n Pineapples 8.0g\n Peaches 7.0g\n Watermelon 6.4g\n Note: The sugar content of fruits can vary depending on the variety, ripeness, and other factors. This table is based on average values and should be used as a general guide only.\n \n\n To sort the table from most to least sugar content, you can click on the header of the \"Sugar Content\" column to sort in ascending order. This will show the fruits with the highest sugar content at the top of the list, and the fruits with the lowest sugar content at the bottom of the list. You can also sort the table in descending order by clicking on the header of the \"Sugar Content\" column again.\n \n\n It's important to keep in mind that while sugar is an important part of a balanced diet, it's important to consume it in moderation and to consider the overall nutrition profile of the fruits you choose. In addition to sugar, fruits also provide important vitamins, minerals, and fiber that are essential for a healthy diet. When selecting fruits, it's a good idea to choose a variety of different types to get a range of nutrients and flavors."},
    ],
    [
        {'role': 'system', 'content': ""},
        {'role': 'user', 'content': "Can you deduplicate this list: 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Apples\n 13. Mango\n 14. Watermelon\n 15. Peaches\n<extra_id_1>Assistant\nHere is the deduplicated list:\n \n\n 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Watermelon\n 13. Peaches\n<extra_id_1>User\nSort this list alphabetically\n<extra_id_1>Assistant\nHere is the alphabetized list:\n \n\n 1. Apples\n 2. Bananas\n 3. Grapes\n 4. Kiwi\n 5. Mango\n 6. Oranges\n 7. Peaches\n 8. Pineapples\n 9. Strawberries\n 10. Watermelon\n<extra_id_1>User\nCreate a table with these with a column with sugar content and sort from more sugar to less", "message": "Can you deduplicate this list: 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Apples\n 13. Mango\n 14. Watermelon\n 15. Peaches\n<extra_id_1>Assistant\nHere is the deduplicated list:\n \n\n 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Watermelon\n 13. Peaches\n<extra_id_1>User\nSort this list alphabetically\n<extra_id_1>Assistant\nHere is the alphabetized list:\n \n\n 1. Apples\n 2. Bananas\n 3. Grapes\n 4. Kiwi\n 5. Mango\n 6. Oranges\n 7. Peaches\n 8. Pineapples\n 9. Strawberries\n 10. Watermelon\n<extra_id_1>User\nCreate a table with these with a column with sugar content and sort from more sugar to less"},
        {'role': 'assistant', 'content': "Here is a table with the sugar content of the fruits in the list, sorted from more sugar to less:\n \n\n Fruit Sugar Content (g/100g)\n Grapes 15.4\n Bananas 14.4\n Mango 14.2\n Pineapples 11.8\n Apples 10.4\n Oranges 9.3\n Kiwi 9.3\n Strawberries 7.4\n Peaches 7.0\n Watermelon 6.2\n Note: The sugar content of fruits can vary depending on factors such as ripeness, variety, and growing conditions. The values in this table are approximate and may not be exact."},
    ]
]

score = [get_score(model, tokenizer=tokenizer, conversation=message) for message in messages]

print(score) # [0.14022722840309143, 1.9332066774368286]
```

## Citation

If you find our work helpful, feel free to give us a citation.

```

@article{prmlessons,

  title={ModelingWorldPreference}, 

  author={
    Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, An Yang, Dayiheng Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Bowen Yu, Jingren Zhou, and Junyang Lin
  },

  journal={arXiv preprint arXiv:2505.10527},

  year={2025}

}

```