Hub

API docs

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Evaluation Results

This is a work in progress feature.

The Hub provides a decentralized system for tracking model evaluation results. Benchmark datasets host leaderboards, and model repos store evaluation scores that automatically appear on both the model page and the benchmark’s leaderboard.

Benchmark Datasets

Dataset repos can be defined as Benchmarks (e.g., AIME, HLE, GPQA). These display a “Benchmark” tag and automatically aggregate evaluation results from model repos across the Hub and display a leaderboard of top models.

Benchmark Dataset

Registering a Benchmark

To register your dataset as a benchmark:

Create a dataset repo containing your evaluation data
Add an eval.yaml file to the repo root with your benchmark configuration
The file is validated at push time
(Beta) Get in touch so we can add it to the allow-list.

The eval.yaml format is based on Inspect AI, enabling reproducible evaluations. See the Evaluating models with Inspect guide for details on running evaluations.

Model Evaluation Results

Evaluation scores are stored in model repos as YAML files in the .eval_results/ folder. These results:

Appear on the model page with links to the benchmark leaderboard
Are aggregated into the benchmark dataset’s leaderboards
Can be submitted via PRs and marked as “community-provided”

Model Evaluation Results

Adding Evaluation Results

To add evaluation results to a model, you can submit a PR to the model repo with a YAML file in the .eval_results/ folder.

Create a YAML file in .eval_results/*.yaml in your model repo:

- dataset:
    id: cais/hle                  # Required. Hub dataset ID (must be a Benchmark)
    task_id: default              # Optional, in case there are multiple tasks or leaderboards for this dataset.
    revision: <hash>              # Optional. Dataset revision hash
  value: 20.90                    # Required. Metric value
  verifyToken: <token>            # Optional. Cryptographic proof of auditable evaluation
  date: 2025-01-15T10:30:00Z      # Optional. ISO-8601 datetime (defaults to git commit time)
  source:                         # Optional. Attribution for the result
    url: https://huggingface.co/datasets/cais/hle  # Required if source provided
    name: CAIS HLE                # Optional. Display name
    user: cais                    # Optional. HF username/org

Or, with only the required attributes:

- dataset:
    id: Idavidrein/gpqa
    task_id: gpqa_diamond
  value: 0.412

Results display badges based on their metadata in the YAML file:

Badge	Condition
verified	A `verifyToken` is valid (evaluation ran in HF Jobs with inspect-ai)
community-provided	Result submitted via open PR (not merged to main)
leaderboard	Links to the benchmark dataset
source	Links to evaluation logs or external source

For more details on how to format this data, check out the Eval Results specifications.

Community Contributions

Anyone can submit evaluation results to any model via Pull Request:

Go to the model page and click on the “Community” tab and open a Pull Request.
Add a .eval_results/*.yaml file with your results.
The PR will show as “community-provided” on the model page while open.

For help evaluating a model, see the Evaluating models with Inspect guide.

Update on GitHub

←Card Components Gated Models→