policy optimization
					Collection
				
				4 items
				• 
				Updated
					
				
This model was trained as part of the Reinforcement Learning - 24 project at Peking University, focusing on [dpo].
This model is a fine-tuned version of microsoft/Phi-3-mini-128k-instruct on the princeton-nlp/llama3-ultrafeedback dataset. It achieves the following results on the evaluation set:
More information needed
More information needed
More information needed
The following hyperparameters were used during training:
| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.62 | 0.8549 | 400 | 0.6104 | -1.0659 | -1.4533 | 0.6976 | 0.3875 | -433.6641 | -403.0910 | 1.6821 | 1.6709 | 
Base model
microsoft/Phi-3-mini-128k-instruct