T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
Abstract
Token- and Turn-level Policy Optimization (T²PO) addresses multi-turn RL instability by controlling exploration at fine-grained levels through uncertainty monitoring and dynamic resampling.
Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where policies continue to generate low-information actions that neither reduce uncertainty nor advance task progress. To address this issue, we propose Token- and Turn-level Policy Optimization (T^2PO), an uncertainty-aware framework that explicitly controls exploration at fine-grained levels. At the token level, T^2PO monitors uncertainty dynamics and triggers a thinking intervention once the marginal uncertainty change falls below a threshold. At the turn level, T^2PO identifies interactions with negligible exploration progress and dynamically resamples such turns to avoid wasted rollouts. We evaluate T^2PO in diverse environments, including WebShop, ALFWorld, and Search QA, demonstrating substantial gains in training stability and performance improvements with better exploration efficiency. Code is available at: https://github.com/WillDreamer/T2PO.
Community
We are excited to share T²PO, an uncertainty-guided exploration control method for stable multi-turn agentic reinforcement learning. T²PO improves exploration at both token and turn levels, leading to more stable and sample-efficient training for LLM agents.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning (2026)
- DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents (2026)
- GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning (2026)
- FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization (2026)
- SLEA-RL: Step-Level Experience Augmented Reinforcement Learning for Multi-Turn Agentic Training (2026)
- Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR (2026)
- Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.02178 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper