Blog, Articles, and discussions

Community Articles
view all
leaderboardevaluationnlp

Arabic Leaderboards: Introducing Arabic Instruction Following, Updating AraGen, and More

  • +2
20
April 8, 2025
math-verifyopen-llm-leaderboardleaderboard

Fixing Open LLM Leaderboard with Math-Verify

30
February 14, 2025
nlpresearchleaderboard

The Open Arabic LLM Leaderboard 2

  • +3
36
February 10, 2025
open-llm-leaderboardleaderboardenergy_efficiency

CO₂ Emissions and Models Performance: Insights from the Open LLM Leaderboard

21
January 9, 2025
leaderboardresearchcollaboration

Evaluating Audio Reasoning with Big Bench Audio

26
December 20, 2024
leaderboardevaluationnlp

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

  • +1
37
December 4, 2024
communityresearchnlp

Letting Large Models Debate: The First Multilingual LLM Debate Competition

  • +8
33
November 20, 2024
communityresearchnlp

Introduction to the Open Leaderboard for Japanese LLMs

  • +2
39
November 20, 2024
leaderboardarenacollaboration

Judge Arena: Benchmarking LLMs as Evaluators

  • +4
58
November 19, 2024
leaderboardcollaborationcommunity

Introducing the Open FinLLM Leaderboard

  • +9
79
October 4, 2024
nlpresearchleaderboard

🇨🇿 BenCzechMark - Can your LLM Understand Czech?

  • +7
23
October 1, 2024
ai4mathnlpcommunity

How NuminaMath Won the 1st AIMO Progress Prize

  • +4
122
July 11, 2024
agentssmolagentsnlp

Our Transformers Code Agent beats the GAIA benchmark!

98
July 1, 2024
leaderboardresearchcollaboration

BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks

  • +5
52
June 18, 2024