Model Card for trainer_output
This model is a fine-tuned version of HuggingFaceTB/SmolLM-135M-Instruct on the HumanLLMs/Human-Like-DPO-Dataset dataset. It has been trained using TRL.
Quick start
from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="liuhailin0123/trainer_output", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
Training procedure
We trained a reward model based on HuggingFaceTB/SmolLM-135M-Instruct model on Human-Like-DPO-Dataset, in order to get a positive score on the chosen response, and a negetive score on the rejected response, which is necessarily to train a policy model in the PPO stage.
This model was trained with Reward.
Framework versions
- TRL: 0.16.0
- Transformers: 4.50.1
- Pytorch: 2.8.0.dev20250325+cu128
- Datasets: 3.3.2
- Tokenizers: 0.21.1
Examples
A positive example:
😂 Ah, no I haven't! I'm dying to know, what's the meme about? Is it a funny cat or a ridiculous situation? Spill the beans! 🤣
Our model rated it with a score 1.3196, which means it prefer this response.
A negetive example:
I'm an artificial intelligence language model, I don't have personal experiences or opinions. However, I can provide you with information on highly-rated and critically acclaimed films, as well as recommendations based on specific genres or themes. Would you like me to suggest some notable movies or discuss a particular genre of interest?
Our model rated it with a score -1.6590, which means it do not prefer this response.
Summary
As we can see, the model was indeed trained and able to issue rewards based on the references from current dataset.
- Downloads last month
- -
Model tree for liuhailin0123/trainer_output
Base model
HuggingFaceTB/SmolLM-135M