DatologyAI CLIP Retrieval Optimized ViT-B/32
DatologyAI CLIP Retrieval is a state-of-the-art contrastive vision-language model optimized for image-text retrieval tasks through advanced data curation. This retrieval-optimized ViT-B/32 model achieves competitive performance with SigLIP2 while requiring significantly less compute.
Model Description
DatologyAI's retrieval-optimized CLIP model demonstrates superior performance on retrieval benchmarks through targeted data curation strategies:
- State-of-the-art MSCOCO performance for ViT-B/32 models
 - 2x training efficiency compared to SigLIP2
 - Optimized for text-based distribution alignment
 - Standard CLIP architecture with retrieval-focused data curation
 
Intended Uses
This model is optimized for image-text retrieval tasks, cross-modal search, and multimodal understanding applications.
Image-to-Text Retrieval
import torch
from PIL import Image
import open_clip
# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/retr-opt-vit-b-32')
tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/retr-opt-vit-b-32')
# Load and process image
image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)
# Define text candidates
texts = [
    "a photo of a cat",
    "a dog playing in the park",
    "a beautiful sunset over the ocean",
    "people walking in a city"
]
text_tokens = tokenizer(texts)
# Compute similarities
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text_tokens)
    
    # Normalize features
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # Calculate similarity
    similarity = (100.0 * image_features @ text_features.T)
    
# Get top matches
values, indices = similarity[0].topk(len(texts))
for idx, score in zip(indices, values):
    print(f"{texts[idx]}: {score.item():.2f}")
Text-to-Image Retrieval
import torch
import open_clip
from typing import List
def retrieve_images(query: str, image_features: torch.Tensor, top_k: int = 5):
    """
    Retrieve top-k images for a text query
    
    Args:
        query: Text description to search for
        image_features: Pre-computed normalized image features [N, 512]
        top_k: Number of images to retrieve
    """
    # Encode text query
    text_tokens = tokenizer([query])
    with torch.no_grad():
        text_features = model.encode_text(text_tokens)
        text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # Compute similarities
    similarities = (100.0 * text_features @ image_features.T).squeeze()
    
    # Get top-k matches
    values, indices = similarities.topk(top_k)
    return indices.tolist(), values.tolist()
# Example usage
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/retr-opt-vit-b-32')
tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/retr-opt-vit-b-32')
# Pre-compute image features for your dataset
# image_features = ... # Shape: [num_images, 512]
# Search for images
indices, scores = retrieve_images("a red sports car", image_features)
Training Procedure
DatologyAI's retrieval-optimized pipeline employs specialized curation techniques:
- Text-aligned distribution matching - Prioritizes alignment along text representations for retrieval tasks
 - Retrieval-specific synthetic data - Optimized caption generation for cross-modal understanding
 - Balanced multimodal representation - Ensures strong performance in both directions
 
The model uses standard CLIP contrastive objectives without architectural modifications.
Training Data
The model was trained on image-text pairs curated from the DataComp-XL dataset using DatologyAI's retrieval-optimized curation pipeline, selecting high-quality pairs that enhance cross-modal alignment.
Evaluation Results
Retrieval Performance
| Benchmark | Metric | DatologyAI | SigLIP2 | MetaCLIP | 
|---|---|---|---|---|
| MSCOCO | Retrieval@1 | 55.53% | 55.45% | 46.6% | 
| Flickr30K | Retrieval@1 | 79.7% | 82.4% | 72.9% | 
Training Efficiency
- Matches SigLIP2 MSCOCO performance with 50% fewer samples (20B vs 40B)
 - Exceeds MetaCLIP by >5% absolute on both benchmarks
 
Model Details
- Developed by: DatologyAI
 - Model type: CLIP (Contrastive Language-Image Pre-training)
 - Architecture: Vision Transformer B/32
 - License: Apache 2.0
 - Training framework: OpenCLIP 2.24.0
 - Optimization focus: Image-text retrieval
 
Technical Specifications
Model Architecture
- Vision Encoder: ViT-B/32 (86M parameters)
- Patch size: 32×32
 - Image size: 224×224
 - Embedding dimension: 512
 
 - Text Encoder: 12-layer Transformer
- Context length: 77 tokens
 - Vocabulary size: 49,408 (BPE tokenizer)
 
 
Training Configuration
- Optimizer: AdamW (β1=0.9, β2=0.98, ε=1e-6)
 - Learning rate: 1e-3 with cosine schedule
 - Weight decay: 0.1
 - Batch size: 32,768
 - Training approach: Retrieval-optimized data curation
 - Hardware: Distributed training on H100 GPUs
 
Usage Tips
- Feature Caching: For large-scale retrieval, pre-compute and cache image features
 - Batch Processing: Process multiple queries simultaneously for efficiency
 - Normalization: Always normalize features before computing similarities
 - Temperature Scaling: Adjust similarity temperature for different use cases
 
Citation
If you use this model, please cite:
@article{datologyai2025clip,
  title={CLIP Gets a Data Upgrade: Outperforming SoTA with Improved Data Curation Only},
  author={DatologyAI Team},
  journal={DatologyAI Blog},
  year={2025},
  url={https://datologyai.com/blog/clip-data-upgrade}
}
Additional Information
For more details on our data curation methodology and comprehensive benchmark results, please visit our blog post.
Contact: [email protected]
Model Card Contact
DatologyAI Team - [email protected]
- Downloads last month
 - 1,303