Who Routes LLM Routers? RouterArena: Building the Evaluation Foundation for LLM Routing

Community Article Published November 11, 2025

Upvote

Yifan Lu*, Rixin Liu*, Jiayi Yuan*, Xingqi Cui, Shenrun Zhang, Hongyi Liu, Jiarong Xing
*Equal contribution · Rice University

📄 Arxiv: paper | 🔗 GitHub: RouterArena | 🤗 Huggingface: Dataset

The Diversifying Landscape of LLMs

Figure from paper: Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review

For years, our community has pursued the goal of building a single, general-purpose foundation model capable of handling all questions and tasks, and this effort has achieved remarkable success. As scaling laws kicked in, these models have rapidly expanded to trillions of parameters and now surpass human performance on a wide range of benchmarks.

However, it is becoming increasingly clear that this scaling trend may not be sustainable. We are hitting the data wall, where high-quality training data is running out, as highlighted by Ilya Sutskever, OpenAI’s co-founder, at NeurIPS 2024. Future scaling will depend on generating new, high-quality data, which often requires costly human labeling and curation.

As a result, the LLM landscape is diversifying. In addition to chasing large general-purpose models, people are also exploring smaller, more efficient, and specialized ones. A good example is the Qwen family, which now includes nine categories, such as Qwen3-Coder, Qwen3-Image, and Qwen3-Guard. These models range from 0.6B to 480B parameters, with many specialized variants staying under 30B, handling relatively simple questions more efficiently.

This shift is further accelerated by startups and open-source initiatives embracing model specialization and customization. For instance, ThinkingMachine is building personalized AI systems, while rLLM provides an open framework for training domain-specific or user-tailored agents. Together, these efforts mark a clear transition from a “one-model-for-all” paradigm to a diverse ecosystem of LLMs, ranging from massive generalists to compact specialists.

The Key is Model Routers

As models continue to diversify (in both sizes and skills), a new challenge emerges: how do we choose the right model for the right task? If the era of large, general models was about “one model for everything,” the next era is about matching each query to the model that answers it best.

Therefore, automated query-to-model routing will be increasingly important. A helpful analogy to understand its importance is Google Search. When you type a query, the search engine scans billions of pages and routes you to the most relevant source. Similarly, as the model ecosystem expands, we’ll need intelligent routers that analyze an input and decide which model (or combination of models) can handle it most effectively.

The routing can happen at many levels: selecting between models of different sizes to balance cost and accuracy, choosing among specialized experts to get the highest-quality answers, or even orchestrating a workflow of multiple models that collaborate to complete a complex task.

This idea is no longer just theoretical. As shown in the following figure, we’re already seeing a wave of router systems emerging across academia and industry—some simple and rule-based, others adaptive, learned, and data-driven. The most notable example is GPT-5, which is said to incorporate an internal router that dynamically selects among different models or “experts” depending on the task.

Who Routes the Routers? RouterArena!

In the near future, designing good routers will be just as important as training good models. This means that, just as we evaluate and analyze models today, we will need a rigorous study of routers to understand their performance, efficiency, and decision behaviors.

Differently, evaluating routers is far more challenging than evaluating models. Router evaluation is inherently multi-dimensional: there isn’t a single metric that captures how well a router performs. People care about many different aspects, e.g., query-answer quality, cost efficiency, routing consistency, robustness, and more. Even for the same query, the optimal router decision can change depending on the available model candidates, cost constraints, or deployment settings, making fair and consistent evaluation even harder.

Unfortunately, there is currently no open platform that allows the public to evaluate routers using a comprehensive dataset and standardized metrics for fair comparison. That’s why we built RouterArena!

What is RouterArena?

RouterArena is an open platform for rigorous and comprehensive router evaluation. It provides (1) a principally constructed dataset with broad knowledge domain coverage, (2) distinguishable difficulty levels for each domain, (3) an extensive list of evaluation metrics, and (4) an automated evaluation framework. On top of that, we host a public leaderboard, giving the community a place to compare, track, and improve routers over time.

Evaluation dataset

RouterArena introduces a carefully constructed evaluation dataset built around two core design principles: diverse domain coverage and clear difficulty separation. For broad coverage, we draw inspiration from the Dewey Decimal Classification (DDC) system used in libraries to organize the world’s knowledge. This ensures that the dataset spans a wide range of domains across science, humanities, and applied disciplines. To differentiate query complexity, we adopt Bloom’s taxonomy, grouping questions into three levels—easy, medium, and hard—so routers can be tested on their ability to balance accuracy and cost when selecting between smaller and larger models.

Following these principles, we curated data from 23 open-source datasets, applied LLM-based difficulty annotation, and ensured balanced distribution across all categories. After deduplication, our final dataset contains 8,400 queries across 9 domains and 44 categories, each represented at multiple difficulty levels. The following figure illustrates the detailed composition of the dataset. Note that the distribution of difficulty levels is skewed, but it reflects real-world query patterns—easy and medium questions occurring more frequently than hard ones.

Evaluation metrics

RouterArena evaluates routers across 5 key dimensions to capture their real-world effectiveness.

Query-answer accuracy. It captures a router’s ability to direct queries to the appropriate models such that they are correctly answered.
Query-answer cost. This measures the cost incurred by a router’s routing decisions.
Routing optimality. This reflects a router’s ability to perform optimal routing—that is, selecting the cheapest model that still produces a correct response
Routing robustness. This metric evaluates the router’s robustness against noisy inputs.
Routing latency. It measures the latency overhead introduced by routing, since the router operates in the critical path of systems in production, it must handle millions of queries per second with minimal overhead.

Leaderboard

Left: Current RouterArena Leaderboard; Commercial Routers are closed-source routers for profit. Academic Routers are open source routers from academic papers. Right: The router's trade-off curve between accuracy and inference cost.

Building on these metrics, we maintain a live leaderboard where the community can compare routers across different dimensions of performance (check it out here). The figure above shows our overall “Arena Score,” a composite metric that captures the critical accuracy–cost trade-off. It summarizes how well a router balances correctness and efficiency, and is calculated using the following formula:

$S_{i,\beta} = \frac{(1 + {\beta})A_{i}C_{i}}{{\beta}A_{i} + C_{i}}$

Where $A_{i}$ represents the router’s accuracy, we set $\beta = 0.1$ for a balanced weight between accuracy and cost, and $C_{i}$ is given by:

$C_{i} = \frac{log_{2}(c_{max}) - log_{2}c_{i}}{log_{2}(c_{max}) - log_2(c_{min})}$ $c_{max} = 200, c_{min} = 0.0044$ representing the cost of the most expansive model (OpenAI/O3-Pro) and the least expansive model (GLM-4-Air).

We have found that MIRT-BERT currently stands out as the most cost-effective router, achieving accuracy comparable to Azure-Router while operating at roughly one-fifth of the cost. In contrast, GPT-5 remains the performance leader but does so at significantly higher inference costs, highlighting the trade-off between routing accuracy and operational efficiency.

Ready to Show Off Your Routers?

We envision RouterArena as an open and community-driven project. The leaderboard will continue to evolve as new routers are developed, evaluated, and integrated. Our goal is to make router evaluation as open and effortless as possible, so anyone can test their router and see how it performs against existing ones.

If you would like your router to be on the board, please see instructions on our GitHub: https://github.com/RouteWorks/RouterArena.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote