--- base_model: HuggingFaceTB/SmolLM-135M-Instruct datasets: HumanLLMs/Human-Like-DPO-Dataset library_name: transformers model_name: trainer_output tags: - generated_from_trainer - trl - reward-trainer licence: license license: mit language: - en --- # Model Card for trainer_output This model is a fine-tuned version of [HuggingFaceTB/SmolLM-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct) on the [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) dataset. It has been trained using [TRL](https://github.com/huggingface/trl). ## Quick start ```python from transformers import pipeline question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?" generator = pipeline("text-generation", model="liuhailin0123/trainer_output", device="cuda") output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0] print(output["generated_text"]) ``` ## Training procedure We trained a reward model based on HuggingFaceTB/SmolLM-135M-Instruct model on Human-Like-DPO-Dataset, in order to get a positive score on the chosen response, and a negetive score on the rejected response, which is necessarily to train a policy model in the PPO stage. This model was trained with Reward. ### Framework versions - TRL: 0.16.0 - Transformers: 4.50.1 - Pytorch: 2.8.0.dev20250325+cu128 - Datasets: 3.3.2 - Tokenizers: 0.21.1 ## Examples A positive example: 😂 Ah, no I haven't! I'm dying to know, what's the meme about? Is it a funny cat or a ridiculous situation? Spill the beans! 🤣 Our model rated it with a score 1.3196, which means it prefer this response. A negetive example: I'm an artificial intelligence language model, I don't have personal experiences or opinions. However, I can provide you with information on highly-rated and critically acclaimed films, as well as recommendations based on specific genres or themes. Would you like me to suggest some notable movies or discuss a particular genre of interest? Our model rated it with a score -1.6590, which means it do not prefer this response. ## Summary As we can see, the model was indeed trained and able to issue rewards based on the references from current dataset.