Improve model card: Add pipeline tag, library, abstract, and comprehensive usage
Browse filesThis PR significantly enhances the model card for UniLIP-3B by:
- Adding the `pipeline_tag: any-to-any`, accurately reflecting the model's unified multimodal capabilities in understanding, generation, and editing.
- Specifying `library_name: transformers` as the model is compatible with the 🤗 Transformers library (evidence from `config.json` showing `transformers_version`). This enables automated code snippets for easy usage on the Hub.
- Updating the paper link to point directly to the Hugging Face papers page: [UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing](https://huggingface.co/papers/2507.23278).
- Including the full paper abstract, providing essential context about the model's design and contributions.
- Adding detailed "Main Results" tables to showcase the model's performance across various tasks.
- Integrating a comprehensive "Quick Start" section, including installation instructions and "Simple Inference" code snippets for image generation, understanding, and omni-potent tasks, directly from the official GitHub repository. The `model_path` in the code has been updated to refer to `kanashi6/UniLIP-3B` for direct usability.
- Including "Acknowledgements" and "Citation" sections for proper attribution.
These changes will make the model more discoverable, informative, and user-friendly on the Hugging Face Hub.
|
@@ -1,5 +1,6 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
datasets:
|
| 4 |
- BLIP3o/BLIP3o-Pretrain-Long-Caption
|
| 5 |
- BLIP3o/BLIP3o-Pretrain-Short-Caption
|
|
@@ -7,15 +8,190 @@ datasets:
|
|
| 7 |
- UCSC-VLAA/GPT-Image-Edit-1.5M
|
| 8 |
- BLIP3o/BLIP3o-60k
|
| 9 |
- FreedomIntelligence/ShareGPT-4o-Image
|
| 10 |
-
|
| 11 |
-
|
|
|
|
| 12 |
---
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
UniLIP proposes a unified, CLIP-based encoder featuring both rich semantics and fine-grained image details. Through a **two-stage and self-distillation training** for reconstruction, we empower CLIP to achieve excellent reconstruction results **without compromising its original understanding abilities**. Leveraging this powerful unified representation, UniLIP excels across understanding, generation, and editing tasks.
|
| 16 |
|
| 17 |
For more details, please refer to the original paper and the GitHub repository:
|
| 18 |
|
| 19 |
-
Paper: https://
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- OpenGVLab/InternVL3-2B
|
| 4 |
datasets:
|
| 5 |
- BLIP3o/BLIP3o-Pretrain-Long-Caption
|
| 6 |
- BLIP3o/BLIP3o-Pretrain-Short-Caption
|
|
|
|
| 8 |
- UCSC-VLAA/GPT-Image-Edit-1.5M
|
| 9 |
- BLIP3o/BLIP3o-60k
|
| 10 |
- FreedomIntelligence/ShareGPT-4o-Image
|
| 11 |
+
license: apache-2.0
|
| 12 |
+
pipeline_tag: any-to-any
|
| 13 |
+
library_name: transformers
|
| 14 |
---
|
| 15 |
+
|
| 16 |
+
# UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
|
| 17 |
+
|
| 18 |
+
This repository contains the model (3B version) presented in the paper [UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing](https://huggingface.co/papers/2507.23278).
|
| 19 |
|
| 20 |
UniLIP proposes a unified, CLIP-based encoder featuring both rich semantics and fine-grained image details. Through a **two-stage and self-distillation training** for reconstruction, we empower CLIP to achieve excellent reconstruction results **without compromising its original understanding abilities**. Leveraging this powerful unified representation, UniLIP excels across understanding, generation, and editing tasks.
|
| 21 |
|
| 22 |
For more details, please refer to the original paper and the GitHub repository:
|
| 23 |
|
| 24 |
+
Paper: [UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing](https://huggingface.co/papers/2507.23278)
|
| 25 |
+
|
| 26 |
+
GitHub: https://github.com/nnnth/UniLIP
|
| 27 |
+
|
| 28 |
+
<div align="center">
|
| 29 |
+
<img src="https://github.com/nnnth/UniLIP/raw/main/assets/img1.png" width="800"/>
|
| 30 |
+
</div>
|
| 31 |
+
|
| 32 |
+
## Abstract
|
| 33 |
+
|
| 34 |
+
In this paper, we propose UniLIP, a unified framework that adapts CLIP for multimodal understanding, generation and editing. Although CLIP excels at understanding, it lacks reconstruction abilities required to be a unified visual encoder. However, previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. In contrast, we introduce a novel two-stage training scheme with a self-distillation strategy that progressively endows CLIP with high-fidelity reconstruction abilities while preserving its original comprehension performance. For enhanced reasoning and consistency in generation and editing, we further develop a dual-condition architecture built upon the MetaQuery framework. Our architecture jointly utilizes multimodal hidden states for rich contextual details and learnable query embeddings to harness the powerful reasoning abilities of Multimodal Large Language Models (MLLMs). Leveraging advanced image representation and architectural design, UniLIP demonstrates superior instruction following and edit fidelity. With only 1B and 3B parameters, UniLIP can outperform larger unified models such as BAGEL (7B) and Uniworld-V1 (12B), achieving state-of-the-art performance of 0.90 on GenEval, 0.63 on WISE, and 3.94 on ImgEdit. These results demonstrate that UniLIP successfully expands the application of CLIP, establishing its continuous features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks. Code and models are available at this https URL .
|
| 35 |
+
|
| 36 |
+
## Main Results
|
| 37 |
+
|
| 38 |
+
### Image Reconstruction on ImageNet val
|
| 39 |
+
|
| 40 |
+
| Model | Res. | ratio | rFID ↓ | PSNR↑ | SSIM↑ |
|
| 41 |
+
| :--- | :--- | :--- | :--- | :--- | :--- |
|
| 42 |
+
| VILA-U | 256 | 16 | 1.80 | - | - |
|
| 43 |
+
| Tokenflow | 256 | 16 | 1.37 | 21.41 | 0.687 |
|
| 44 |
+
| DualViTok | 256 | 16 | 1.37 | 22.53 | 0.741 |
|
| 45 |
+
| **UniLIP** | 256 | 32 | **0.79** | **22.99** | **0.747** |
|
| 46 |
+
| Emu2 | 448 | 14 | 3.27 | 13.49 | 0.423 |
|
| 47 |
+
| **UniLIP** | 448 | 32 | **0.31** | **24.62** | **0.788** |
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
### Image Understanding
|
| 51 |
+
| Model | # LLM Params | MME-P | MMB | MMMU | MM-Vet | SEED | AI2D | MMVP |
|
| 52 |
+
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|
| 53 |
+
| InternVL3-1B | 1B | 1492 | 72.6 | 43.4 | 59.5 | 71.1 | 69.4 | 67.3 |
|
| 54 |
+
| InternVL3-2B | 1.8B | 1633 | 80.6 | 48.2 | 62.2 | 75.0 | 78.5 | 72.7 |
|
| 55 |
+
| BAGEL-3B | 3B | 1610 | 79.2 | 43.2 | 48.2 | - | - | 54.7 |
|
| 56 |
+
| BLIP3o-4B | 4B | 1528 | 78.6 | 46.6 | 60.1 | 73.8 | - | - |
|
| 57 |
+
| TokLIP-7B | 7B | 1410 | - | 42.1 | - | 65.2 | - | - |
|
| 58 |
+
| Tar-7B | 7B | 1571 | 74.4 | 39.0 | | 73.0 | - | - |
|
| 59 |
+
| **UniLIP-1B** | 1B | 1499 | 72.6 | 43.3 | 59.4 | 71.0 | 70.7 | 68.7 |
|
| 60 |
+
| **UniLIP-3B** | 2B | **1636** | **80.7** | **48.7** | **62.2** | **75.0** | **78.6** | **73.0** |
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
### Image Generation and Editing
|
| 65 |
+
| Model | # Params | GenEval | WISE | ImgEdit |
|
| 66 |
+
| :--- | :--- | :--- | :--- | :--- |
|
| 67 |
+
| BAGEL | 7B+7B | 0.82 | 0.52 | 3.20 |
|
| 68 |
+
| BLIP3o-4B | 3B+1.4B | 0.81 | 0.50 | - |
|
| 69 |
+
| UniWorld-V1 | 7B+12B | - | - | 3.26 |
|
| 70 |
+
| **UniLIP-1B** | 1B+0.6B | 0.88 | 0.56 | 3.81 |
|
| 71 |
+
| **UniLIP-3B** | 2B+1.6B | **0.90** | **0.63** | **3.94** |
|
| 72 |
+
|
| 73 |
+
## 🛠️ Quick Start
|
| 74 |
+
|
| 75 |
+
### Installation
|
| 76 |
+
|
| 77 |
+
```shell
|
| 78 |
+
conda create -n UniLIP python=3.11
|
| 79 |
+
conda activate UniLIP
|
| 80 |
+
pip install torch==2.6.0+cu118 torchvision==0.21.0+cu118 --index-url https://download.pytorch.org/whl/cu118
|
| 81 |
+
pip install -r requirements.txt
|
| 82 |
+
pip install -e .
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
### Simple Inference
|
| 86 |
+
|
| 87 |
+
The simplest code for UniLiP inference, covering generation, understanding, and omni-potent tasks:
|
| 88 |
+
|
| 89 |
+
```python
|
| 90 |
+
from inference_solver import FlexARInferenceSolver
|
| 91 |
+
from PIL import Image
|
| 92 |
+
|
| 93 |
+
# ******************** Image Generation ********************
|
| 94 |
+
print("--- Image Generation Example ---")
|
| 95 |
+
inference_solver = FlexARInferenceSolver(
|
| 96 |
+
model_path="kanashi6/UniLIP-3B", # Using the model from this repository
|
| 97 |
+
precision="bf16",
|
| 98 |
+
target_size=768,
|
| 99 |
+
)
|
| 100 |
+
|
| 101 |
+
q1 = f"Generate an image of 768x768 according to the following prompt:
|
| 102 |
+
" \
|
| 103 |
+
f"Image of a dog playing water, and a waterfall is in the background."
|
| 104 |
+
|
| 105 |
+
# generated: tuple of (generated response, list of generated images)
|
| 106 |
+
generated = inference_solver.generate(
|
| 107 |
+
images=[],
|
| 108 |
+
qas=[[q1, None]],
|
| 109 |
+
max_gen_len=8192,
|
| 110 |
+
temperature=1.0,
|
| 111 |
+
logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
a1, new_image = generated[0], generated[1][0]
|
| 115 |
+
print(f"Generated text: {a1}")
|
| 116 |
+
# new_image.save("generated_dog_image.png") # Uncomment to save the generated image
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
# ******************* Image Understanding ******************
|
| 120 |
+
print("
|
| 121 |
+
--- Image Understanding Example ---")
|
| 122 |
+
inference_solver_understanding = FlexARInferenceSolver(
|
| 123 |
+
model_path="kanashi6/UniLIP-3B", # Using the model from this repository
|
| 124 |
+
precision="bf16",
|
| 125 |
+
target_size=512,
|
| 126 |
+
)
|
| 127 |
+
|
| 128 |
+
# "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM
|
| 129 |
+
q2 = "Describe the image in detail. <|image|>"
|
| 130 |
+
|
| 131 |
+
# Assuming 'image.png' exists in the current directory for demonstration
|
| 132 |
+
# Example: images = [Image.open("path/to/your/image.png")]
|
| 133 |
+
images_for_understanding = [Image.new('RGB', (512, 512), color = 'white')] # Placeholder if no file found
|
| 134 |
+
qas_for_understanding = [[q2, None]]
|
| 135 |
+
|
| 136 |
+
# `len(images)` should be equal to the number of appearance of "<|image|>" in qas
|
| 137 |
+
generated_understanding = inference_solver_understanding.generate(
|
| 138 |
+
images=images_for_understanding,
|
| 139 |
+
qas=qas_for_understanding,
|
| 140 |
+
max_gen_len=8192,
|
| 141 |
+
temperature=1.0,
|
| 142 |
+
logits_processor=inference_solver_understanding.create_logits_processor(cfg=4.0, image_top_k=2000),
|
| 143 |
+
)
|
| 144 |
+
|
| 145 |
+
a2 = generated_understanding[0]
|
| 146 |
+
print(f"Generated description: {a2}")
|
| 147 |
+
# generated[1], namely the list of newly generated images, should typically be empty in this case.
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
# ********************* Omni-Potent Example (Depth Estimation) *********************
|
| 151 |
+
print("
|
| 152 |
+
--- Omni-Potent Example (Depth Estimation) ---")
|
| 153 |
+
inference_solver_omni = FlexARInferenceSolver(
|
| 154 |
+
model_path="kanashi6/UniLIP-3B", # Using the model from this repository
|
| 155 |
+
precision="bf16",
|
| 156 |
+
target_size=768,
|
| 157 |
+
)
|
| 158 |
+
|
| 159 |
+
# Example: Depth Estimation
|
| 160 |
+
# For more instructions, see demos/demo_image2image.py
|
| 161 |
+
q3 = "Depth estimation. <|image|>"
|
| 162 |
+
# Assuming 'image.png' exists in the current directory for demonstration
|
| 163 |
+
# Example: images = [Image.open("path/to/your/image.png")]
|
| 164 |
+
images_for_omni = [Image.new('RGB', (768, 768), color = 'gray')] # Placeholder if no file found
|
| 165 |
+
qas_for_omni = [[q3, None]]
|
| 166 |
+
|
| 167 |
+
generated_omni = inference_solver_omni.generate(
|
| 168 |
+
images=images_for_omni,
|
| 169 |
+
qas=qas_for_omni,
|
| 170 |
+
max_gen_len=8192,
|
| 171 |
+
temperature=1.0,
|
| 172 |
+
logits_processor=inference_solver_omni.create_logits_processor(cfg=1.0, image_top_k=200),
|
| 173 |
+
)
|
| 174 |
+
|
| 175 |
+
a3 = generated_omni[0]
|
| 176 |
+
new_image_omni = generated_omni[1][0]
|
| 177 |
+
print(f"Generated text (depth estimation): {a3}")
|
| 178 |
+
# new_image_omni.save("generated_depth_map.png") # Uncomment to save the generated depth map
|
| 179 |
+
```
|
| 180 |
+
|
| 181 |
+
## 👍 Acknowledgement
|
| 182 |
+
* [TiTok](https://github.com/bytedance/1d-tokenizer) We implement reconstruction training following TiTok.
|
| 183 |
+
* [BLIP3-o](https://github.com/JiuhaiChen/BLIP3o/tree/main) Thanks to BLIP3-o for providing the generation data and training code.
|
| 184 |
+
* [InternVL](https://github.com/OpenGVLab/InternVL) We use InternVL3 as the MLLM pretraining.
|
| 185 |
+
* [SANA](https://github.com/NVlabs/Sana) We use SANA as the DiT pretraining.
|
| 186 |
+
* [DC-AE](https://github.com/mit-han-lab/efficientvit) We use the pixel decoder from DC-AE.
|
| 187 |
|
| 188 |
+
## 📘 Citation
|
| 189 |
+
Please consider citing our work as follows if it is helpful.
|
| 190 |
+
```bibtex
|
| 191 |
+
@article{tang2025unilip,
|
| 192 |
+
title={UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing},
|
| 193 |
+
author={Tang, Hao and Xie, Chenwei and Bao, Xiaoyi and Weng, Tingyu and Li, Pandeng and Zheng, Yun and Wang, Liwei},
|
| 194 |
+
journal={arXiv preprint arXiv:2507.23278},
|
| 195 |
+
year={2025}
|
| 196 |
+
}
|
| 197 |
+
```
|