File size: 1,846 Bytes
0592343
 
 
 
 
 
 
 
 
 
 
 
 
 
19d5416
0592343
19d5416
0592343
19d5416
0592343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
license: apache-2.0
tags:
  - image-feature-extraction
  - image-text-retrieval
  - multimodal
  - siglip
  - person-search
datasets:
  - custom
language:
  - en
pipeline_tag: image-feature-extraction
---

# πŸ” SigLIP Person Search - Open Set

This model is a fine-tuned version of **`google/siglip-base-patch16-224`** for open-set **person retrieval** based on **natural language descriptions**. It's built to support **image-text similarity** in real-world retail and surveillance scenarios.

## 🧠 Use Case

This model allows you to search for people in crowded environments (like malls or stores) using only a **text prompt**, for example:

> "A man wearing a white t-shirt and carrying a brown shoulder bag"

The model will return person crops that match the description.

## πŸ’Ύ Training

* Base: `google/siglip-base-patch16-224`
* Loss: Cosine InfoNCE
* Data: ReID dataset with multimodal attributes (generated via Gemini)
* Epochs: 10
* Usage: Retrieval-style search (not classification)

## πŸ“ˆ Intended Use

* Smart surveillance
* Anonymous retail behavior tracking
* Human-in-the-loop retrieval
* Visual search & retrieval systems

## πŸ”§ How to Use

```python
from transformers import AutoProcessor, AutoModel
import torch

processor = AutoProcessor.from_pretrained("adonaivera/siglip-person-search-openset")
model = AutoModel.from_pretrained("adonaivera/siglip-person-search-openset")

text = "A man wearing a white t-shirt and carrying a brown shoulder bag"
inputs = processor(text=text, return_tensors="pt")
with torch.no_grad():
    text_features = model.get_text_features(**inputs)
```

## πŸ“Œ Notes

* This model is optimized for **feature extraction** and **cosine similarity matching**
* It's not meant for classification or image generation
* Similarity threshold tuning is required depending on your application