GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? Paper • 2606.17861 • Published 2 days ago • 36
GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? Paper • 2606.17861 • Published 2 days ago • 36
PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions Paper • 2606.14832 • Published 6 days ago • 10
PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions Paper • 2606.14832 • Published 6 days ago • 10
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents Paper • 2605.07630 • Published May 8 • 1
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents Paper • 2605.07630 • Published May 8 • 1
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows Paper • 2604.28139 • Published Apr 30 • 42
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows Paper • 2604.28139 • Published Apr 30 • 42
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning Paper • 2604.16029 • Published Apr 17 • 23
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning Paper • 2604.16029 • Published Apr 17 • 23
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models Paper • 2604.10866 • Published Apr 13 • 67
OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration Paper • 2602.05400 • Published Feb 5 • 356