Описание модели

В рамках домашнего задания по курсу "Современный NLP. Большие языковые модели" от vk.education было реализовано дообучение модели методом Proximal Policy Optimization (PPO)

Как использовать:

tokenizer = AutoTokenizer.from_pretrained('georgebu/llm-course-hw2-ppo')
dpo_model = AutoModelForCausalLM.from_pretrained(georgebu/llm-course-hw2-ppo)

messages = [{"role": "user", "content": "What kind of sport do you like"}]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = dpo_model.generate(model_inputs.input_ids, max_new_tokens=256, do_sample=False)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Примеры генерации:

Промт: "What is your morning routine like?"

Ответ модели:
Proximal Policy Optimization (PPO) is a subfield of Machine Learning that focuses on designing the policies that reduce the variance of a neural network's performance by minimizing the reconstruction loss. The goal of PPO is to identify the optimal policy that reduces the variance while encouraging the model to be more accurate and robust to noise.
In a traditional neural network, the model tries to achieve a good approximation from the training data. However, the model is often biased towards the lower-bounds of the approximation, leading to a high loss in the vicinity of the true values, even when the network estimates the true values as smooth. This is called the "inference problem."
PPO addresses this problem by learning a policy that is designed to take the true values of the data and infer the behavior of the network. The goal is to design a policy that assigns a high probability to the network's correct output, while minimizing the mispredictions. This is a relatively simple problem compared to the true-to-true ratio, and many popular optimization algorithms, such as gradient descent and Adam, are specifically designed to take the true values (called "global minimum" in the context of PPO) as a function of the network's parameters.
PPO provides a more robust and efficient implementation than traditional batch-distributed neural networks (BDN), which is a more complex and computationally expensive approach. In a BDN, the model is optimized for a specific task or environment, such as a regression task, and the network is trained to learn a suitable policy that minimizes the performance loss. In contrast, PPO designs the network to adaptively learn a more promising policy when a certain misclassification error occurs, rather than solely optimizing the final value of the network.

Баги:

Иногда модель "сходит с ума" и начинает генерировать один и тот же заученный текст про функции на Python. Видимо, это баг, который она поймала при обучении. К сожалению, иногда она не способна к ответу на какой-либо вопрос и пишет только заученный текст.