|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- image-feature-extraction |
|
|
- image-text-retrieval |
|
|
- multimodal |
|
|
- siglip |
|
|
- person-search |
|
|
datasets: |
|
|
- custom |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: image-feature-extraction |
|
|
--- |
|
|
|
|
|
# π SigLIP Person Search - Open Set |
|
|
|
|
|
This model is a fine-tuned version of **`google/siglip-base-patch16-224`** for open-set **person retrieval** based on **natural language descriptions**. It's built to support **image-text similarity** in real-world retail and surveillance scenarios. |
|
|
|
|
|
## π§ Use Case |
|
|
|
|
|
This model allows you to search for people in crowded environments (like malls or stores) using only a **text prompt**, for example: |
|
|
|
|
|
> "A man wearing a white t-shirt and carrying a brown shoulder bag" |
|
|
|
|
|
The model will return person crops that match the description. |
|
|
|
|
|
## πΎ Training |
|
|
|
|
|
* Base: `google/siglip-base-patch16-224` |
|
|
* Loss: Cosine InfoNCE |
|
|
* Data: ReID dataset with multimodal attributes (generated via Gemini) |
|
|
* Epochs: 10 |
|
|
* Usage: Retrieval-style search (not classification) |
|
|
|
|
|
## π Intended Use |
|
|
|
|
|
* Smart surveillance |
|
|
* Anonymous retail behavior tracking |
|
|
* Human-in-the-loop retrieval |
|
|
* Visual search & retrieval systems |
|
|
|
|
|
## π§ How to Use |
|
|
|
|
|
```python |
|
|
from transformers import AutoProcessor, AutoModel |
|
|
import torch |
|
|
|
|
|
processor = AutoProcessor.from_pretrained("adonaivera/siglip-person-search-openset") |
|
|
model = AutoModel.from_pretrained("adonaivera/siglip-person-search-openset") |
|
|
|
|
|
text = "A man wearing a white t-shirt and carrying a brown shoulder bag" |
|
|
inputs = processor(text=text, return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
text_features = model.get_text_features(**inputs) |
|
|
``` |
|
|
|
|
|
## π Notes |
|
|
|
|
|
* This model is optimized for **feature extraction** and **cosine similarity matching** |
|
|
* It's not meant for classification or image generation |
|
|
* Similarity threshold tuning is required depending on your application |
|
|
|