Cascade Reward Sampling for Efficient Decoding-Time Alignment Paper • 2406.16306 • Published Jun 24, 2024
DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning Paper • 2510.02341 • Published Sep 27 • 2
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment Paper • 2504.02193 • Published Apr 3