Reduced SigLIP for Person Visual Descriptions

This model is part of the family of reduced-dimension variants of google/siglip-base-patch16-224 and google/siglip2-base-patch16-224 finetuned for person visual description. It reduces the original embedding dimension to a smaller space using trainable linear projection layers.


Model Details for Reduced version

  • Base model: google/siglip-base-patch16-224
  • Reduced dimension: 64
  • Architecture modifications: Added two linear layers (for text and image) to project embeddings from the original dimension down to reduced_dim.

Intended Uses & Limitations

Example Applications

  • Person retrieval based on textual or visual descriptions of the person
  • Person re-identification using visual descriptions
  • Embedding extraction for retrieval systems

Limitations and Bias

  • May inherit biases from the base SigLIP model and training data
  • Not suitable for tasks requiring detailed fine-grained recognition without further training
  • Trained on surveillance data; suitable for tasks where a substantial portion of the person is visible

Training

Loss function

  • Type: Soft contrastive loss with label smoothing
  • Description: The model is trained to align text and image embeddings using a modified contrastive loss. Instead of hard one-hot targets, label smoothing distributes a small probability mass to all other samples in the batch. Embeddings are normalized before computing similarity, and the loss is computed symmetrically for image-to-text and text-to-image directions.

Evaluation Metric

  • Metric: Truncated Cumulative Matching Characteristic (CMC) AUC
  • Description: Measures the fraction of queries where the correct match appears within the top K ranks (e.g., top-10). Unlike MRR or strict top-1 accuracy, this metric rewards consistent retrieval of all relevant matches near the top ranks, rather than a few perfect hits with others ranked very low.

Datasets

  • Sources: CUHK, ICFG, IIITD, ITCPR, PRW
  • Processing: Text descriptions were processed with the Mistral LLM to remove ambiguous information about pose or context, leaving only clear visual characteristics in a structured format, where key features are separated with comma.

Training Setup

  • Frozen pretrained base model initially; only projection layers were trained.
  • Fine-tuning: later trained the model head and then the whole model.
  • Optimizer: AdamW
  • Learning rate: 1e-4 for warm-up, 5e-6 for the rest
  • Epochs: 4 epochs for projection layer warm-up, 4 epochs for model head fine-tuning, and 20 epochs for full model fine-tuning
  • Visual Augmentations: random operations including small rotations, hue variations, horizontal flip, and color jitter
  • Text Augmentations: random subsets of key features are selected and removed from the comma-separated description strings to create augmented training samples
  • Training Codes: codes are available on [github] (https://github.com/MarketaJu/ReducedSiglipForPersonDescription)

Results

We evaluated the model on a custom test split from the datasets mentioned above. The following table summarizes the number of identities, images, and queries for each subset.
The final test set is a merge of all subsets.

Dataset #Identities #Images #Queries
CUHK 791 2263 4526
ICFG 602 6714 6714
IIITD 981 981 1962
ITCPR 999 1607 1607
PRW 293 3459 3855
Final (All) 3666 15024 18664

Evaluation of the Model per Dataset

Since our task is focused on retrieving the correct person identity rather than the exact matching image, the evaluation is performed as follows:

  • For each text query, the goal is to retrieve the correct identity, not the exact corresponding image.
  • During evaluation, scores are computed over all images belonging to the same identity, and the maximum score is taken to represent that identity.
  • These identity-level scores are then ranked, and the following retrieval metrics are calculated:
    • Top-k: standard top-1, top-5, top-10 accuracy
    • MRR: Mean Reciprocal Rank
    • CMC AUC: Truncated CMC AUC evaluated up to rank 20
Dataset Top-1 Top-5 Top-10 MRR CMC AUC (20)
CUHK 65.0 86.4 92.0 73.3 90.8
ICFG 59.5 83.3 89.6 68.0 88.4
IIITD 84.7 96.8 98.4 90.1 97.9
ITCPR 44.4 69.8 79.9 54.5 78.7
PRW 76.5 94.4 97.5 82.1 96.5
Final (All) 63.7 85.1 90.6 71.6 89.6

Note: Partial datasets contain a smaller number of identities (โ‰ˆ600) compared to the merged test set (โ‰ˆ4000). This difference in scale explains the variation in accuracy between individual subsets and the final combined results.

Cross-Model Comparison

Model based on google/siglip-base-patch16-224:

Model Variant Top-1 Top-5 Top-10 MRR CMC AUC (20)
google/siglip-base-patch16-224 26.3 47.8 57.4 35.5 56.7
finetuned_siglip 69.1 87.5 92.2 75.9 91.4
siglip-person-description-128 64.8 85.4 90.6 72.5 89.8
siglip-person-description-64 63.7 85.1 90.6 71.6 89.6
siglip-person-description-32 57.6 82.1 88.9 66.6 87.6

Model based on google/siglip2-base-patch16-224:

Model Variant Top-1 Top-5 Top-10 MRR CMC AUC (20)
google/siglip2-base-patch16-224 20.4 39.5 48.9 28.5 48.6
finetuned_siglip2 67.9 87.4 92.2 75.0 91.3
siglip2-person-description-128 64.1 85.4 90.0 72.1 89.9
siglip2-person-description-64 62.9 84.9 90.4 71.1 89.6
siglip2-person-description-32 56.9 81.9 88.7 66.2 87.4

Usage

The usage is identical to SigLIP.


# Import custom model code from repository
from modeling_resipvd import ReSiPVDModel

# Load the model from Hugging Face Hub
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")
model = AutoModel.from_pretrained("MarketaJu/siglip-person-description-64")

# Example: get embeddings
from skimage.io import imread
image = imread("test.jpg")
text_inputs = processor(text=["random person description"], return_tensors="pt", padding="max_length", truncation=True)
image_inputs = processor(images=image, return_tensors="pt", padding="max_length", truncation=True)
text_embeds = model.get_text_features(**text_inputs)
image_embeds = model.get_image_features(**image_inputs)


---

## Citation

If you use this model, please cite:

```bibtex
@misc{reduced-siglip-visualdescription,
  title={Reduced SigLIP for Visual Descriptions},
  author={Marketa Jurankova},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/MarketaJu/reduced-siglip}}
}
Downloads last month
3
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including MarketaJu/siglip-person-description-64