Model Card for trainer_output

This model is a fine-tuned version of HuggingFaceTB/SmolLM-135M-Instruct on the HumanLLMs/Human-Like-DPO-Dataset dataset. It has been trained using TRL.

Quick start

from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="liuhailin0123/trainer_output", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Training procedure

We trained a reward model based on HuggingFaceTB/SmolLM-135M-Instruct model on Human-Like-DPO-Dataset, in order to get a positive score on the chosen response, and a negetive score on the rejected response, which is necessarily to train a policy model in the PPO stage.
This model was trained with Reward.

Framework versions

  • TRL: 0.16.0
  • Transformers: 4.50.1
  • Pytorch: 2.8.0.dev20250325+cu128
  • Datasets: 3.3.2
  • Tokenizers: 0.21.1

Examples

A positive example:
😂 Ah, no I haven't! I'm dying to know, what's the meme about? Is it a funny cat or a ridiculous situation? Spill the beans! 🤣 Our model rated it with a score 1.3196, which means it prefer this response. A negetive example:
I'm an artificial intelligence language model, I don't have personal experiences or opinions. However, I can provide you with information on highly-rated and critically acclaimed films, as well as recommendations based on specific genres or themes. Would you like me to suggest some notable movies or discuss a particular genre of interest? Our model rated it with a score -1.6590, which means it do not prefer this response.

Summary

As we can see, the model was indeed trained and able to issue rewards based on the references from current dataset.

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for liuhailin0123/trainer_output

Finetuned
(158)
this model

Dataset used to train liuhailin0123/trainer_output

Collection including liuhailin0123/trainer_output