Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents Paper • 2509.09265 • Published Sep 11 • 46
SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning Paper • 2509.02479 • Published Sep 2 • 83
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs Paper • 2410.18451 • Published Oct 24, 2024 • 20