VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications Paper • 2509.26490 • Published Sep 30 • 19
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs Paper • 2509.09677 • Published Sep 11 • 34
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data Paper • 2509.15221 • Published Sep 18 • 109