OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows Paper • 2510.24411 • Published 22 days ago • 70
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution Paper • 2510.25726 • Published 21 days ago • 44
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety Paper • 2510.08240 • Published Oct 9 • 40
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense Paper • 2510.07242 • Published Oct 8 • 30
RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization Paper • 2510.02172 • Published Oct 2 • 7
Jointly Reinforcing Diversity and Quality in Language Model Generations Paper • 2509.02534 • Published Sep 2 • 24
SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond Paper • 2505.19641 • Published May 26 • 67
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows Paper • 2505.19897 • Published May 26 • 104
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Paper • 2505.13227 • Published May 19 • 45
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era Paper • 2503.12329 • Published Mar 16 • 27
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction Paper • 2502.07316 • Published Feb 11 • 50
UI Agent Collection a collection of algorithmic agents for user interfaces/interactions, program synthesis, and robotics • 425 items • Updated 6 days ago • 64
Teaching Language Models to Critique via Reinforcement Learning Paper • 2502.03492 • Published Feb 5 • 24
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis Paper • 2412.19723 • Published Dec 27, 2024 • 87
Diving into Self-Evolving Training for Multimodal Reasoning Paper • 2412.17451 • Published Dec 23, 2024 • 43
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models Paper • 2411.17451 • Published Nov 26, 2024 • 11
Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale Paper • 2409.17115 • Published Sep 25, 2024 • 63
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models Paper • 2404.12387 • Published Apr 18, 2024 • 39