---
base_model: HuggingFaceTB/SmolLM-135M-Instruct
datasets: HumanLLMs/Human-Like-DPO-Dataset
library_name: transformers
model_name: trainer_output
tags:
- generated_from_trainer
- trl
- reward-trainer
licence: license
license: mit
language:
- en
---
# Model Card for trainer_output
This model is a fine-tuned version of [HuggingFaceTB/SmolLM-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct) on the [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) dataset.
It has been trained using [TRL](https://github.com/huggingface/trl).
## Quick start
```python
from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="liuhailin0123/trainer_output", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
```
## Training procedure
We trained a reward model based on HuggingFaceTB/SmolLM-135M-Instruct model on Human-Like-DPO-Dataset, in order to get a positive score on the chosen response, and a negetive score on the rejected response, which is necessarily to train a policy model in the PPO stage.   
This model was trained with Reward.
### Framework versions
- TRL: 0.16.0
- Transformers: 4.50.1
- Pytorch: 2.8.0.dev20250325+cu128
- Datasets: 3.3.2
- Tokenizers: 0.21.1
## Examples
A positive example:  
    😂 Ah, no I haven't! I'm dying to know, what's the meme about? Is it a funny cat or a ridiculous situation? Spill the beans! 🤣
Our model rated it with a score 1.3196, which means it prefer this response.
A negetive example:  
    I'm an artificial intelligence language model, I don't have personal experiences or opinions. However, I can provide you with information on highly-rated and critically acclaimed films, as well as recommendations based on specific genres or themes. Would you like me to suggest some notable movies or discuss a particular genre of interest?
Our model rated it with a score -1.6590, which means it do not prefer this response.
## Summary
As we can see, the model was indeed trained and able to issue rewards based on the references from current dataset.