Reduced SigLIP for Person Visual Descriptions
This model is part of the family of reduced-dimension variants of google/siglip-base-patch16-224 and google/siglip2-base-patch16-224 finetuned for person visual description. It reduces the original embedding dimension to a smaller space using trainable linear projection layers.
Model Details for Reduced version
- Base model:
google/siglip-base-patch16-224 - Reduced dimension: 64
- Architecture modifications: Added two linear layers (for text and image) to project embeddings from the original dimension down to
reduced_dim.
Intended Uses & Limitations
Example Applications
- Person retrieval based on textual or visual descriptions of the person
- Person re-identification using visual descriptions
- Embedding extraction for retrieval systems
Limitations and Bias
- May inherit biases from the base SigLIP model and training data
- Not suitable for tasks requiring detailed fine-grained recognition without further training
- Trained on surveillance data; suitable for tasks where a substantial portion of the person is visible
Training
Loss function
- Type: Soft contrastive loss with label smoothing
- Description: The model is trained to align text and image embeddings using a modified contrastive loss. Instead of hard one-hot targets, label smoothing distributes a small probability mass to all other samples in the batch. Embeddings are normalized before computing similarity, and the loss is computed symmetrically for image-to-text and text-to-image directions.
Evaluation Metric
- Metric: Truncated Cumulative Matching Characteristic (CMC) AUC
- Description: Measures the fraction of queries where the correct match appears within the top
Kranks (e.g., top-10). Unlike MRR or strict top-1 accuracy, this metric rewards consistent retrieval of all relevant matches near the top ranks, rather than a few perfect hits with others ranked very low.
Datasets
- Sources: CUHK, ICFG, IIITD, ITCPR, PRW
- Processing: Text descriptions were processed with the Mistral LLM to remove ambiguous information about pose or context, leaving only clear visual characteristics in a structured format, where key features are separated with comma.
Training Setup
- Frozen pretrained base model initially; only projection layers were trained.
- Fine-tuning: later trained the model head and then the whole model.
- Optimizer: AdamW
- Learning rate: 1e-4 for warm-up, 5e-6 for the rest
- Epochs: 4 epochs for projection layer warm-up, 4 epochs for model head fine-tuning, and 20 epochs for full model fine-tuning
- Visual Augmentations: random operations including small rotations, hue variations, horizontal flip, and color jitter
- Text Augmentations: random subsets of key features are selected and removed from the comma-separated description strings to create augmented training samples
- Training Codes: codes are available on [github] (https://github.com/MarketaJu/ReducedSiglipForPersonDescription)
Results
We evaluated the model on a custom test split from the datasets mentioned above. The following table summarizes the number of identities, images, and queries for each subset.
The final test set is a merge of all subsets.
| Dataset | #Identities | #Images | #Queries |
|---|---|---|---|
| CUHK | 791 | 2263 | 4526 |
| ICFG | 602 | 6714 | 6714 |
| IIITD | 981 | 981 | 1962 |
| ITCPR | 999 | 1607 | 1607 |
| PRW | 293 | 3459 | 3855 |
| Final (All) | 3666 | 15024 | 18664 |
Evaluation of the Model per Dataset
Since our task is focused on retrieving the correct person identity rather than the exact matching image, the evaluation is performed as follows:
- For each text query, the goal is to retrieve the correct identity, not the exact corresponding image.
- During evaluation, scores are computed over all images belonging to the same identity, and the maximum score is taken to represent that identity.
- These identity-level scores are then ranked, and the following retrieval metrics are calculated:
- Top-k: standard top-1, top-5, top-10 accuracy
- MRR: Mean Reciprocal Rank
- CMC AUC: Truncated CMC AUC evaluated up to rank 20
| Dataset | Top-1 | Top-5 | Top-10 | MRR | CMC AUC (20) |
|---|---|---|---|---|---|
| CUHK | 65.0 | 86.4 | 92.0 | 73.3 | 90.8 |
| ICFG | 59.5 | 83.3 | 89.6 | 68.0 | 88.4 |
| IIITD | 84.7 | 96.8 | 98.4 | 90.1 | 97.9 |
| ITCPR | 44.4 | 69.8 | 79.9 | 54.5 | 78.7 |
| PRW | 76.5 | 94.4 | 97.5 | 82.1 | 96.5 |
| Final (All) | 63.7 | 85.1 | 90.6 | 71.6 | 89.6 |
Note: Partial datasets contain a smaller number of identities (โ600) compared to the merged test set (โ4000). This difference in scale explains the variation in accuracy between individual subsets and the final combined results.
Cross-Model Comparison
Model based on google/siglip-base-patch16-224:
| Model Variant | Top-1 | Top-5 | Top-10 | MRR | CMC AUC (20) |
|---|---|---|---|---|---|
| google/siglip-base-patch16-224 | 26.3 | 47.8 | 57.4 | 35.5 | 56.7 |
| finetuned_siglip | 69.1 | 87.5 | 92.2 | 75.9 | 91.4 |
| siglip-person-description-128 | 64.8 | 85.4 | 90.6 | 72.5 | 89.8 |
| siglip-person-description-64 | 63.7 | 85.1 | 90.6 | 71.6 | 89.6 |
| siglip-person-description-32 | 57.6 | 82.1 | 88.9 | 66.6 | 87.6 |
Model based on google/siglip2-base-patch16-224:
| Model Variant | Top-1 | Top-5 | Top-10 | MRR | CMC AUC (20) |
|---|---|---|---|---|---|
| google/siglip2-base-patch16-224 | 20.4 | 39.5 | 48.9 | 28.5 | 48.6 |
| finetuned_siglip2 | 67.9 | 87.4 | 92.2 | 75.0 | 91.3 |
| siglip2-person-description-128 | 64.1 | 85.4 | 90.0 | 72.1 | 89.9 |
| siglip2-person-description-64 | 62.9 | 84.9 | 90.4 | 71.1 | 89.6 |
| siglip2-person-description-32 | 56.9 | 81.9 | 88.7 | 66.2 | 87.4 |
Usage
The usage is identical to SigLIP.
# Import custom model code from repository
from modeling_resipvd import ReSiPVDModel
# Load the model from Hugging Face Hub
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")
model = AutoModel.from_pretrained("MarketaJu/siglip-person-description-64")
# Example: get embeddings
from skimage.io import imread
image = imread("test.jpg")
text_inputs = processor(text=["random person description"], return_tensors="pt", padding="max_length", truncation=True)
image_inputs = processor(images=image, return_tensors="pt", padding="max_length", truncation=True)
text_embeds = model.get_text_features(**text_inputs)
image_embeds = model.get_image_features(**image_inputs)
---
## Citation
If you use this model, please cite:
```bibtex
@misc{reduced-siglip-visualdescription,
title={Reduced SigLIP for Visual Descriptions},
author={Marketa Jurankova},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/MarketaJu/reduced-siglip}}
}
- Downloads last month
- 3