MetaCLIP-2-Cifar10

MetaCLIP-2-Cifar10 is an image classification vision–language encoder model fine-tuned from facebook/metaclip-2-worldwide-s16 for a single-label classification task. It is designed to identify and categorize images into the ten CIFAR-10 object classes using the MetaClip2ForImageClassification architecture.

MetaCLIP 2: A Worldwide Scaling Recipe : https://huggingface.co/papers/2507.22062

Classification report:

              precision    recall  f1-score   support

    airplane     0.9813    0.9685    0.9748      2000
  automobile     0.9777    0.9850    0.9813      2000
        bird     0.9560    0.9560    0.9560      2000
         cat     0.9104    0.9395    0.9247      2000
        deer     0.9566    0.9580    0.9573      2000
         dog     0.9476    0.9215    0.9343      2000
        frog     0.9774    0.9735    0.9755      2000
       horse     0.9704    0.9670    0.9687      2000
        ship     0.9782    0.9890    0.9836      2000
       truck     0.9774    0.9735    0.9755      2000

    accuracy                         0.9631     20000
   macro avg     0.9633    0.9632    0.9632     20000
weighted avg     0.9633    0.9631    0.9632     20000

The model classifies images into the following categories:

Class 0: airplane
Class 1: automobile
Class 2: bird
Class 3: cat
Class 4: deer
Class 5: dog
Class 6: frog
Class 7: horse
Class 8: ship
Class 9: truck

Run with Transformers

!pip install -q transformers torch pillow gradio

import gradio as gr
from transformers import AutoImageProcessor
from transformers import AutoModelForImageClassification
from transformers.image_utils import load_image
from PIL import Image
import torch

# Load model and processor
model_name = "prithivMLmods/MetaCLIP-2-Cifar10"
model = AutoModelForImageClassification.from_pretrained(model_name)
processor = AutoImageProcessor.from_pretrained(model_name)

def cifar10_classification(image):
    """Predicts the CIFAR-10 class represented in an image."""
    image = Image.fromarray(image).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.nn.functional.softmax(logits, dim=1).squeeze().tolist()

    labels = {
        "0": "airplane",
        "1": "automobile",
        "2": "bird",
        "3": "cat",
        "4": "deer",
        "5": "dog",
        "6": "frog",
        "7": "horse",
        "8": "ship",
        "9": "truck"
    }
    predictions = {labels[str(i)]: round(probs[i], 3) for i in range(len(probs))}

    return predictions

# Create Gradio interface
iface = gr.Interface(
    fn=cifar10_classification,
    inputs=gr.Image(type="numpy"),
    outputs=gr.Label(label="Prediction Scores"),
    title="CIFAR-10 Classification",
    description="Upload an image to classify it into one of the CIFAR-10 categories."
)

# Launch the app
if __name__ == "__main__":
    iface.launch()

Sample Inference:

Intended Use:

The MetaCLIP-2-Cifar10 model is designed for object classification across the ten CIFAR-10 categories. Potential use cases include:

Educational & Research Applications: Benchmarking experiments, model comparison, and deep learning studies.
Lightweight Vision Systems: Useful for systems requiring simple object recognition.
Dataset Exploration: Assisting in data inspection, annotation, and visualization.
Prototype Systems: Ideal for rapid prototyping in classification pipelines.