nielsr HF Staff commited on
Commit
a8c7038
·
verified ·
1 Parent(s): 42d9b0e

Improve model card: Add pipeline tag, library, abstract, and comprehensive usage

Browse files

This PR significantly enhances the model card for UniLIP-3B by:

- Adding the `pipeline_tag: any-to-any`, accurately reflecting the model's unified multimodal capabilities in understanding, generation, and editing.
- Specifying `library_name: transformers` as the model is compatible with the 🤗 Transformers library (evidence from `config.json` showing `transformers_version`). This enables automated code snippets for easy usage on the Hub.
- Updating the paper link to point directly to the Hugging Face papers page: [UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing](https://huggingface.co/papers/2507.23278).
- Including the full paper abstract, providing essential context about the model's design and contributions.
- Adding detailed "Main Results" tables to showcase the model's performance across various tasks.
- Integrating a comprehensive "Quick Start" section, including installation instructions and "Simple Inference" code snippets for image generation, understanding, and omni-potent tasks, directly from the official GitHub repository. The `model_path` in the code has been updated to refer to `kanashi6/UniLIP-3B` for direct usability.
- Including "Acknowledgements" and "Citation" sections for proper attribution.

These changes will make the model more discoverable, informative, and user-friendly on the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +182 -6
README.md CHANGED
@@ -1,5 +1,6 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - BLIP3o/BLIP3o-Pretrain-Long-Caption
5
  - BLIP3o/BLIP3o-Pretrain-Short-Caption
@@ -7,15 +8,190 @@ datasets:
7
  - UCSC-VLAA/GPT-Image-Edit-1.5M
8
  - BLIP3o/BLIP3o-60k
9
  - FreedomIntelligence/ShareGPT-4o-Image
10
- base_model:
11
- - OpenGVLab/InternVL3-2B
 
12
  ---
13
- This repository contains the model (3B version) presented in the paper UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing.
 
 
 
14
 
15
  UniLIP proposes a unified, CLIP-based encoder featuring both rich semantics and fine-grained image details. Through a **two-stage and self-distillation training** for reconstruction, we empower CLIP to achieve excellent reconstruction results **without compromising its original understanding abilities**. Leveraging this powerful unified representation, UniLIP excels across understanding, generation, and editing tasks.
16
 
17
  For more details, please refer to the original paper and the GitHub repository:
18
 
19
- Paper: https://www.arxiv.org/abs/2507.23278
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- GitHub: https://github.com/nnnth/UniLIP
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - OpenGVLab/InternVL3-2B
4
  datasets:
5
  - BLIP3o/BLIP3o-Pretrain-Long-Caption
6
  - BLIP3o/BLIP3o-Pretrain-Short-Caption
 
8
  - UCSC-VLAA/GPT-Image-Edit-1.5M
9
  - BLIP3o/BLIP3o-60k
10
  - FreedomIntelligence/ShareGPT-4o-Image
11
+ license: apache-2.0
12
+ pipeline_tag: any-to-any
13
+ library_name: transformers
14
  ---
15
+
16
+ # UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
17
+
18
+ This repository contains the model (3B version) presented in the paper [UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing](https://huggingface.co/papers/2507.23278).
19
 
20
  UniLIP proposes a unified, CLIP-based encoder featuring both rich semantics and fine-grained image details. Through a **two-stage and self-distillation training** for reconstruction, we empower CLIP to achieve excellent reconstruction results **without compromising its original understanding abilities**. Leveraging this powerful unified representation, UniLIP excels across understanding, generation, and editing tasks.
21
 
22
  For more details, please refer to the original paper and the GitHub repository:
23
 
24
+ Paper: [UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing](https://huggingface.co/papers/2507.23278)
25
+
26
+ GitHub: https://github.com/nnnth/UniLIP
27
+
28
+ <div align="center">
29
+ <img src="https://github.com/nnnth/UniLIP/raw/main/assets/img1.png" width="800"/>
30
+ </div>
31
+
32
+ ## Abstract
33
+
34
+ In this paper, we propose UniLIP, a unified framework that adapts CLIP for multimodal understanding, generation and editing. Although CLIP excels at understanding, it lacks reconstruction abilities required to be a unified visual encoder. However, previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. In contrast, we introduce a novel two-stage training scheme with a self-distillation strategy that progressively endows CLIP with high-fidelity reconstruction abilities while preserving its original comprehension performance. For enhanced reasoning and consistency in generation and editing, we further develop a dual-condition architecture built upon the MetaQuery framework. Our architecture jointly utilizes multimodal hidden states for rich contextual details and learnable query embeddings to harness the powerful reasoning abilities of Multimodal Large Language Models (MLLMs). Leveraging advanced image representation and architectural design, UniLIP demonstrates superior instruction following and edit fidelity. With only 1B and 3B parameters, UniLIP can outperform larger unified models such as BAGEL (7B) and Uniworld-V1 (12B), achieving state-of-the-art performance of 0.90 on GenEval, 0.63 on WISE, and 3.94 on ImgEdit. These results demonstrate that UniLIP successfully expands the application of CLIP, establishing its continuous features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks. Code and models are available at this https URL .
35
+
36
+ ## Main Results
37
+
38
+ ### Image Reconstruction on ImageNet val
39
+
40
+ | Model | Res. | ratio | rFID ↓ | PSNR↑ | SSIM↑ |
41
+ | :--- | :--- | :--- | :--- | :--- | :--- |
42
+ | VILA-U | 256 | 16 | 1.80 | - | - |
43
+ | Tokenflow | 256 | 16 | 1.37 | 21.41 | 0.687 |
44
+ | DualViTok | 256 | 16 | 1.37 | 22.53 | 0.741 |
45
+ | **UniLIP** | 256 | 32 | **0.79** | **22.99** | **0.747** |
46
+ | Emu2 | 448 | 14 | 3.27 | 13.49 | 0.423 |
47
+ | **UniLIP** | 448 | 32 | **0.31** | **24.62** | **0.788** |
48
+
49
+
50
+ ### Image Understanding
51
+ | Model | # LLM Params | MME-P | MMB | MMMU | MM-Vet | SEED | AI2D | MMVP |
52
+ | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
53
+ | InternVL3-1B | 1B | 1492 | 72.6 | 43.4 | 59.5 | 71.1 | 69.4 | 67.3 |
54
+ | InternVL3-2B | 1.8B | 1633 | 80.6 | 48.2 | 62.2 | 75.0 | 78.5 | 72.7 |
55
+ | BAGEL-3B | 3B | 1610 | 79.2 | 43.2 | 48.2 | - | - | 54.7 |
56
+ | BLIP3o-4B | 4B | 1528 | 78.6 | 46.6 | 60.1 | 73.8 | - | - |
57
+ | TokLIP-7B | 7B | 1410 | - | 42.1 | - | 65.2 | - | - |
58
+ | Tar-7B | 7B | 1571 | 74.4 | 39.0 | | 73.0 | - | - |
59
+ | **UniLIP-1B** | 1B | 1499 | 72.6 | 43.3 | 59.4 | 71.0 | 70.7 | 68.7 |
60
+ | **UniLIP-3B** | 2B | **1636** | **80.7** | **48.7** | **62.2** | **75.0** | **78.6** | **73.0** |
61
+
62
+
63
+
64
+ ### Image Generation and Editing
65
+ | Model | # Params | GenEval | WISE | ImgEdit |
66
+ | :--- | :--- | :--- | :--- | :--- |
67
+ | BAGEL | 7B+7B | 0.82 | 0.52 | 3.20 |
68
+ | BLIP3o-4B | 3B+1.4B | 0.81 | 0.50 | - |
69
+ | UniWorld-V1 | 7B+12B | - | - | 3.26 |
70
+ | **UniLIP-1B** | 1B+0.6B | 0.88 | 0.56 | 3.81 |
71
+ | **UniLIP-3B** | 2B+1.6B | **0.90** | **0.63** | **3.94** |
72
+
73
+ ## 🛠️ Quick Start
74
+
75
+ ### Installation
76
+
77
+ ```shell
78
+ conda create -n UniLIP python=3.11
79
+ conda activate UniLIP
80
+ pip install torch==2.6.0+cu118 torchvision==0.21.0+cu118 --index-url https://download.pytorch.org/whl/cu118
81
+ pip install -r requirements.txt
82
+ pip install -e .
83
+ ```
84
+
85
+ ### Simple Inference
86
+
87
+ The simplest code for UniLiP inference, covering generation, understanding, and omni-potent tasks:
88
+
89
+ ```python
90
+ from inference_solver import FlexARInferenceSolver
91
+ from PIL import Image
92
+
93
+ # ******************** Image Generation ********************
94
+ print("--- Image Generation Example ---")
95
+ inference_solver = FlexARInferenceSolver(
96
+ model_path="kanashi6/UniLIP-3B", # Using the model from this repository
97
+ precision="bf16",
98
+ target_size=768,
99
+ )
100
+
101
+ q1 = f"Generate an image of 768x768 according to the following prompt:
102
+ " \
103
+ f"Image of a dog playing water, and a waterfall is in the background."
104
+
105
+ # generated: tuple of (generated response, list of generated images)
106
+ generated = inference_solver.generate(
107
+ images=[],
108
+ qas=[[q1, None]],
109
+ max_gen_len=8192,
110
+ temperature=1.0,
111
+ logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
112
+ )
113
+
114
+ a1, new_image = generated[0], generated[1][0]
115
+ print(f"Generated text: {a1}")
116
+ # new_image.save("generated_dog_image.png") # Uncomment to save the generated image
117
+
118
+
119
+ # ******************* Image Understanding ******************
120
+ print("
121
+ --- Image Understanding Example ---")
122
+ inference_solver_understanding = FlexARInferenceSolver(
123
+ model_path="kanashi6/UniLIP-3B", # Using the model from this repository
124
+ precision="bf16",
125
+ target_size=512,
126
+ )
127
+
128
+ # "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM
129
+ q2 = "Describe the image in detail. <|image|>"
130
+
131
+ # Assuming 'image.png' exists in the current directory for demonstration
132
+ # Example: images = [Image.open("path/to/your/image.png")]
133
+ images_for_understanding = [Image.new('RGB', (512, 512), color = 'white')] # Placeholder if no file found
134
+ qas_for_understanding = [[q2, None]]
135
+
136
+ # `len(images)` should be equal to the number of appearance of "<|image|>" in qas
137
+ generated_understanding = inference_solver_understanding.generate(
138
+ images=images_for_understanding,
139
+ qas=qas_for_understanding,
140
+ max_gen_len=8192,
141
+ temperature=1.0,
142
+ logits_processor=inference_solver_understanding.create_logits_processor(cfg=4.0, image_top_k=2000),
143
+ )
144
+
145
+ a2 = generated_understanding[0]
146
+ print(f"Generated description: {a2}")
147
+ # generated[1], namely the list of newly generated images, should typically be empty in this case.
148
+
149
+
150
+ # ********************* Omni-Potent Example (Depth Estimation) *********************
151
+ print("
152
+ --- Omni-Potent Example (Depth Estimation) ---")
153
+ inference_solver_omni = FlexARInferenceSolver(
154
+ model_path="kanashi6/UniLIP-3B", # Using the model from this repository
155
+ precision="bf16",
156
+ target_size=768,
157
+ )
158
+
159
+ # Example: Depth Estimation
160
+ # For more instructions, see demos/demo_image2image.py
161
+ q3 = "Depth estimation. <|image|>"
162
+ # Assuming 'image.png' exists in the current directory for demonstration
163
+ # Example: images = [Image.open("path/to/your/image.png")]
164
+ images_for_omni = [Image.new('RGB', (768, 768), color = 'gray')] # Placeholder if no file found
165
+ qas_for_omni = [[q3, None]]
166
+
167
+ generated_omni = inference_solver_omni.generate(
168
+ images=images_for_omni,
169
+ qas=qas_for_omni,
170
+ max_gen_len=8192,
171
+ temperature=1.0,
172
+ logits_processor=inference_solver_omni.create_logits_processor(cfg=1.0, image_top_k=200),
173
+ )
174
+
175
+ a3 = generated_omni[0]
176
+ new_image_omni = generated_omni[1][0]
177
+ print(f"Generated text (depth estimation): {a3}")
178
+ # new_image_omni.save("generated_depth_map.png") # Uncomment to save the generated depth map
179
+ ```
180
+
181
+ ## 👍 Acknowledgement
182
+ * [TiTok](https://github.com/bytedance/1d-tokenizer) We implement reconstruction training following TiTok.
183
+ * [BLIP3-o](https://github.com/JiuhaiChen/BLIP3o/tree/main) Thanks to BLIP3-o for providing the generation data and training code.
184
+ * [InternVL](https://github.com/OpenGVLab/InternVL) We use InternVL3 as the MLLM pretraining.
185
+ * [SANA](https://github.com/NVlabs/Sana) We use SANA as the DiT pretraining.
186
+ * [DC-AE](https://github.com/mit-han-lab/efficientvit) We use the pixel decoder from DC-AE.
187
 
188
+ ## 📘 Citation
189
+ Please consider citing our work as follows if it is helpful.
190
+ ```bibtex
191
+ @article{tang2025unilip,
192
+ title={UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing},
193
+ author={Tang, Hao and Xie, Chenwei and Bao, Xiaoyi and Weng, Tingyu and Li, Pandeng and Zheng, Yun and Wang, Liwei},
194
+ journal={arXiv preprint arXiv:2507.23278},
195
+ year={2025}
196
+ }
197
+ ```