Building on CPU Upgrade 21 BigCodeBench Evaluator 🥇 21 Evaluate code samples using specified parameters
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines Paper • 2502.14739 • Published Feb 20, 2025 • 110 • 10