Model Details

The CLIP model was pretrained from openai/clip-vit-base-patch32 , to learn about what contributes to robustness in computer vision tasks.

The model has the ability to generalize to arbitrary image classification tasks in a zero-shot manner.

Top predictions:

       Saree: 64.89%
     Dupatta: 25.81%
     Lehenga: 7.51%

Leggings and Salwar: 0.84% Women Kurta: 0.44%

image/png

Use with Transformers

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("samim2024/clip")
processor = CLIPProcessor.from_pretrained("samim2024/clip")

url = "https://www.istockphoto.com/photo/indian-saris-gm93355119-10451468"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a saree", "a photo of a blouse"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
Downloads last month
19
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support