MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper • 2509.24002 • Published Sep 28 • 170
Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR Paper • 2508.14029 • Published Aug 19 • 118
Pixels, Patterns, but No Poetry: To See The World like Humans Paper • 2507.16863 • Published Jul 21 • 68
Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning Paper • 2502.11962 • Published Feb 17 • 38
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning Paper • 2506.08989 • Published Jun 10 • 14
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis Paper • 2506.02096 • Published Jun 2 • 52
Sherlock: Self-Correcting Reasoning in Vision-Language Models Paper • 2505.22651 • Published May 28 • 50
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research Paper • 2505.19955 • Published May 26 • 12
Backdoor Cleaning without External Guidance in MLLM Fine-tuning Paper • 2505.16916 • Published May 22 • 16
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning Paper • 2505.11049 • Published May 16 • 60
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models Paper • 2505.10554 • Published May 15 • 120
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation Paper • 2504.13055 • Published Apr 17 • 19
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute Paper • 2503.23803 • Published Mar 31 • 8