WorldPM-72B-UltraFeedback / README.md

littlebird13

Add files using upload-large-folder tool

682099f verified 7 months ago

preview code

raw

history blame

12.9 kB

metadata

license: other
license_name: qwen
license_link: https://huggingface.co/Qwen/Qwen2.5-WorldPM-72B/blob/main/LICENSE
language:
  - en
pipeline_tag: text-classification
library_name: transformers
tags:
  - reward model
  - preference model
base_model:
  - Qwen/Qwen2.5-72B

Qwen2.5-WorldPM-72B

Introduction

Scaling laws have revealed that language modeling performance scales as a power-law with model size and dataset size. We argue that similar laws exist in preference modeling. We propose World Preference Modeling (WorldPM) to emphasize this scaling potential. Our insight is that, despite the apparent diversity of human preferences, preference models can learn a unified preference representation through large-scale training.

To put this philosophical concept into practice, we conduct large-scale preference modeling experiments using 15M preference data from StackExchange on models ranging from 1.5B to 72B parameters for World Preference Modeling(WorldPM). Our findings reveal that as training data size increases, the test loss on well-defined tasks demonstrates power law decrease in preference modeling tasks. The figure below demonstrates the loss trends across 12 test sets spanning subjective, objective, and adversarial domains.

Specially, our findings reveal:

In adversarial evaluation, test losses demonstrate a power law decrease, indicating the model's enhanced ability to identify responses with intentional errors and those that are well-written but irrelevant or incomplete. This finding suggests that subtle deficiencies commonly plaguing discriminative models can be effectively mitigated through large-scale training.
The objective metrics reveal an emergent phenomenon, where larger models demonstrate a power law decrease in test losses across more benchmarks. Specifically, only the 72B model shows consistent scaling benefits, with diminishing effects in 7B and no improvements in 1.5B. These results suggest that WorldPM represents a challenging task that requires larger models to elicit preferences for objective knowledge, pointing to its substantial potential for further advancement.
Subjective evaluations show no apparent scaling trends. We analyze potential reasons from the perspective of style preferences. While WorldPM becomes more style-neutral as it scales up, some subjective evaluations exhibit style preferences, resulting in lower evaluation performance.

For more details, please refer to our paper.

Usage Instructions

WorldPM aims to learn a unified representation of human preferences through large-scale preference training. Although experiments show that WorldPM can generalize to various human preference scenarios, for better application in specific scenarios, we recommend fine-tuning using human preference datasets. Experiments demonstrate that WorldPM serves as a good starting point for fine-tuning.

Requirements

*transformers>=4.40.0 for Qwen2.5-WorldPM-72B model. The latest version is recommended.

<divalign="center">

🚨 This is a must because transformers integrated Qwen2.5 codes since 4.37.0.

For requirements on GPU memory and the respective throughput, see similar results of Qwen2 here.

Quick Start

Prerequisites

The system prompt is empty during training, meaning no specific preferences are imposed on the model. However, the system marker is retained.

The token <|endoftext|> is appended at the end to mark the reward, where its hidden state is mapped to compute the reward value.

🤗 Hugging Face Transformers

Here we show a code snippet to show you how to use the Qwen2.5-WorldPM-72B with transformers:

from transformers import AutoModel, AutoTokenizer, AutoConfig model_name = 'Qwen/Qwen2.5-WorldPM-72B' tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) config = AutoConfig.from_pretrained(model_name,trust_remote_code=True ) model = AutoModel.from_pretrained( model_name, config = config, device_map = "auto", trust_remote_code=True, ).eval() def get_score(model, tokenizer, conversation): con_str = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False) input_ids = tokenizer.encode(con_str, return_tensors="pt", add_special_tokens=False) outputs = model(input_ids=input_ids.to(model.device)) score = outputs[0].cpu().item() return score messages = [ [ {'role': 'system', 'content': ""}, {'role': 'user', 'content': "Can you deduplicate this list: 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Apples\n 13. Mango\n 14. Watermelon\n 15. Peaches\n<extra_id_1>Assistant\nHere is the deduplicated list:\n \n\n 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Watermelon\n 13. Peaches\n<extra_id_1>User\nSort this list alphabetically\n<extra_id_1>Assistant\nHere is the alphabetized list:\n \n\n 1. Apples\n 2. Bananas\n 3. Grapes\n 4. Kiwi\n 5. Mango\n 6. Oranges\n 7. Peaches\n 8. Pineapples\n 9. Strawberries\n 10. Watermelon\n<extra_id_1>User\nCreate a table with these with a column with sugar content and sort from more sugar to less", "message": "Can you deduplicate this list: 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Apples\n 13. Mango\n 14. Watermelon\n 15. Peaches\n<extra_id_1>Assistant\nHere is the deduplicated list:\n \n\n 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Watermelon\n 13. Peaches\n<extra_id_1>User\nSort this list alphabetically\n<extra_id_1>Assistant\nHere is the alphabetized list:\n \n\n 1. Apples\n 2. Bananas\n 3. Grapes\n 4. Kiwi\n 5. Mango\n 6. Oranges\n 7. Peaches\n 8. Pineapples\n 9. Strawberries\n 10. Watermelon\n<extra_id_1>User\nCreate a table with these with a column with sugar content and sort from more sugar to less"}, {'role': 'assistant', 'content': "Here is a table with the list of fruits, sorted from most sugar to least sugar content:\n \n\n Fruit Sugar Content (per 100g)\n Bananas 16.1g\n Apples 10.4g\n Grapes 16.8g\n Kiwi 8.5g\n Mango 10.6g\n Oranges 6.9g\n Strawberries 5.6g\n Pineapples 8.0g\n Peaches 7.0g\n Watermelon 6.4g\n Note: The sugar content of fruits can vary depending on the variety, ripeness, and other factors. This table is based on average values and should be used as a general guide only.\n \n\n To sort the table from most to least sugar content, you can click on the header of the \"Sugar Content\" column to sort in ascending order. This will show the fruits with the highest sugar content at the top of the list, and the fruits with the lowest sugar content at the bottom of the list. You can also sort the table in descending order by clicking on the header of the \"Sugar Content\" column again.\n \n\n It's important to keep in mind that while sugar is an important part of a balanced diet, it's important to consume it in moderation and to consider the overall nutrition profile of the fruits you choose. In addition to sugar, fruits also provide important vitamins, minerals, and fiber that are essential for a healthy diet. When selecting fruits, it's a good idea to choose a variety of different types to get a range of nutrients and flavors."}, ], [ {'role': 'system', 'content': ""}, {'role': 'user', 'content': "Can you deduplicate this list: 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Apples\n 13. Mango\n 14. Watermelon\n 15. Peaches\n<extra_id_1>Assistant\nHere is the deduplicated list:\n \n\n 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Watermelon\n 13. Peaches\n<extra_id_1>User\nSort this list alphabetically\n<extra_id_1>Assistant\nHere is the alphabetized list:\n \n\n 1. Apples\n 2. Bananas\n 3. Grapes\n 4. Kiwi\n 5. Mango\n 6. Oranges\n 7. Peaches\n 8. Pineapples\n 9. Strawberries\n 10. Watermelon\n<extra_id_1>User\nCreate a table with these with a column with sugar content and sort from more sugar to less", "message": "Can you deduplicate this list: 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Apples\n 13. Mango\n 14. Watermelon\n 15. Peaches\n<extra_id_1>Assistant\nHere is the deduplicated list:\n \n\n 1. Apples\n 2. Bananas\n 3. Oranges\n 4. Strawberries\n 5. Kiwis\n 6. Bananas\n 7. Oranges\n 8. Grapes\n 9. Kiwis\n 10. Pineapples\n 11. Mango\n 12. Watermelon\n 13. Peaches\n<extra_id_1>User\nSort this list alphabetically\n<extra_id_1>Assistant\nHere is the alphabetized list:\n \n\n 1. Apples\n 2. Bananas\n 3. Grapes\n 4. Kiwi\n 5. Mango\n 6. Oranges\n 7. Peaches\n 8. Pineapples\n 9. Strawberries\n 10. Watermelon\n<extra_id_1>User\nCreate a table with these with a column with sugar content and sort from more sugar to less"}, {'role': 'assistant', 'content': "Here is a table with the sugar content of the fruits in the list, sorted from more sugar to less:\n \n\n Fruit Sugar Content (g/100g)\n Grapes 15.4\n Bananas 14.4\n Mango 14.2\n Pineapples 11.8\n Apples 10.4\n Oranges 9.3\n Kiwi 9.3\n Strawberries 7.4\n Peaches 7.0\n Watermelon 6.2\n Note: The sugar content of fruits can vary depending on factors such as ripeness, variety, and growing conditions. The values in this table are approximate and may not be exact."}, ] ] score = [get_score(model, tokenizer=tokenizer, conversation=message) for message in messages] print(score) # [0.14022722840309143, 1.9332066774368286]

Citation

If you find our work helpful, feel free to give us a citation.

@article{prmlessons, title={ModelingWorldPreference}, author={ Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, An Yang, Dayiheng Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Bowen Yu, Jingren Zhou, and Junyang Lin }, journal={arXiv preprint arXiv:2505.10527}, year={2025} }