SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models Paper • 2511.05459 • Published 28 days ago • 2
AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? Paper • 2509.03312 • Published Sep 3 • 5
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs Paper • 2510.10689 • Published Oct 12 • 46
CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization Paper • 2507.06181 • Published Jul 8 • 43