Improve model card: Update GitHub link, add abstract, library_name, tags, and sample usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +65 -5
README.md CHANGED
@@ -1,9 +1,13 @@
1
  ---
 
 
 
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
5
- license: apache-2.0
6
- pipeline_tag: image-segmentation
 
7
  ---
8
 
9
  <div align="center">
@@ -22,7 +26,7 @@ Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Ope
22
  <a href="https://www.linkedin.com/in/rita-cucchiara-a4653a13/">Rita Cucchiara</a>
23
  </h3>
24
 
25
- [Project Page](https://lorebianchi98.github.io/Talk2DINO/) | [Paper](http://arxiv.org/abs/2411.19331) | [Code](https://github.com/your-repo/Talk2DINO)
26
 
27
  </div>
28
 
@@ -32,6 +36,63 @@ Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Ope
32
  </figure>
33
  </div>
34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ## Installation
37
  ```bash
@@ -87,5 +148,4 @@ If you found this code useful, please cite the following paper:
87
  primaryClass={cs.CV},
88
  url={https://arxiv.org/abs/2411.19331},
89
  }
90
- ```
91
-
 
1
  ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-segmentation
4
+ library_name: Pytorch
5
  tags:
6
  - model_hub_mixin
7
  - pytorch_model_hub_mixin
8
+ - DINOv2
9
+ - CLIP
10
+ - open-vocabulary segmentation
11
  ---
12
 
13
  <div align="center">
 
26
  <a href="https://www.linkedin.com/in/rita-cucchiara-a4653a13/">Rita Cucchiara</a>
27
  </h3>
28
 
29
+ [Project Page](https://lorebianchi98.github.io/Talk2DINO/) | [Paper](http://arxiv.org/abs/2411.19331) | [Code](https://github.com/lorebianchi98/Talk2DINO)
30
 
31
  </div>
32
 
 
36
  </figure>
37
  </div>
38
 
39
+ ## About
40
+ Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks.
41
+
42
+ ## Sample Usage
43
+
44
+ ### Mapping CLIP Text Embeddings to DINOv2 space with Talk2DINO
45
+ We can use Talk2DINO to map CLIP text embeddings into the DINOv2 patch embedding space.
46
+ ```python
47
+ import clip
48
+ from src.model import ProjectionLayer
49
+ import torch
50
+ import os
51
+
52
+ # Device setup
53
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
54
+
55
+ # Configuration and weights
56
+ proj_name = 'vitb_mlp_infonce'
57
+ config_path = os.path.join("configs", f"{proj_name}.yaml")
58
+ weights_path = os.path.join("weights", f"{proj_name}.pth")
59
+
60
+ # Load Talk2DINO projection layer
61
+ talk2dino = ProjectionLayer.from_config(config_path)
62
+ talk2dino.load_state_dict(torch.load(weights_path, map_location=device))
63
+ talk2dino.to(device)
64
+
65
+ # Load CLIP model
66
+ clip_model, clip_preprocess = clip.load("ViT-B/16", device=device, jit=False)
67
+ tokenizer = clip.tokenize
68
+
69
+ # Example: Tokenize and project text features
70
+ texts = ["a cat"]
71
+ text_tokens = tokenizer(texts).to(device)
72
+ text_features = clip_model.encode_text(text_tokens)
73
+ projected_text_features = talk2dino.project_clip_txt(text_features)
74
+ ```
75
+
76
+ ### Demo
77
+ In `demo.py` we provide a simple example on how to use Talk2DINO for inference on a given image with custom textual categories. Run
78
+
79
+ ```bash
80
+ python demo.py --input custom_input_image --output custom_output_seg [--with_background] --textual_categories category_1,category_2,..
81
+ ```
82
+
83
+ Example:
84
+ ```bash
85
+ python demo.py --input assets/pikachu.png --output pikachu_seg.png --textual_categories pikachu,traffic_sign,forest,route
86
+ ```
87
+
88
+ Result:
89
+ <div align="center">
90
+ <table><tr><td><figure>
91
+ <img alt="" src="./assets/pikachu.png" width=300>
92
+ </figure></td><td><figure>
93
+ <img alt="" src="./pikachu_seg.png" width=300>
94
+ </figure></td></tr></table>
95
+ </div>
96
 
97
  ## Installation
98
  ```bash
 
148
  primaryClass={cs.CV},
149
  url={https://arxiv.org/abs/2411.19331},
150
  }
151
+ ```