Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback Paper • 2307.15217 • Published Jul 27, 2023 • 37
Explore, Establish, Exploit: Red Teaming Language Models from Scratch Paper • 2306.09442 • Published Jun 15, 2023 • 7