Model Card for trainer_output

This model is a fine-tuned version of HuggingFaceTB/SmolLM-135M-Instruct on the HumanLLMs/Human-Like-DPO-Dataset dataset. It has been trained using TRL.

Quick start

from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="liuhailin0123/trainer_output", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Training procedure

We trained a reward model based on HuggingFaceTB/SmolLM-135M-Instruct model on Human-Like-DPO-Dataset, in order to get a positive score on the chosen response, and a negetive score on the rejected response, which is necessarily to train a policy model in the PPO stage.
This model was trained with Reward.

Framework versions

TRL: 0.16.0
Transformers: 4.50.1
Pytorch: 2.8.0.dev20250325+cu128
Datasets: 3.3.2
Tokenizers: 0.21.1

Examples

A positive example:
😂 Ah, no I haven't! I'm dying to know, what's the meme about? Is it a funny cat or a ridiculous situation? Spill the beans! 🤣 Our model rated it with a score 1.3196, which means it prefer this response. A negetive example:
I'm an artificial intelligence language model, I don't have personal experiences or opinions. However, I can provide you with information on highly-rated and critically acclaimed films, as well as recommendations based on specific genres or themes. Would you like me to suggest some notable movies or discuss a particular genre of interest? Our model rated it with a score -1.6590, which means it do not prefer this response.