OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment
| Github | Paper | 🤗HF Models | Modelscope |
we propose OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture. OmniBridge adopts a language-centric design that reuses pretrained LLMs and introduces a lightweight bidirectional latent alignment module for decoupling visual generation, multimodal retrieval, and latent space alignment from the core LLM.
OmniBridge excels in both generation and perception
OmniBridge demonstrate the effectiveness of our framework through extensive experiments on standard vision-language benchmarks, validating that OmniBridge has achieved state-of-the-art or competitive performance in multimodal understanding, generation, and retrieval tasks.
Highlights
- OmniBridge is a unified and modular multimodal framework that supports understanding, generation, and retrieval tasks within a single architecture.
- OmniBridge introduce a two-stage decoupled training strategy that separates behavioral alignment from latent-level alignment, enabling efficient and stable adaptation across diverse multimodal tasks
- OmniBridge design a novel semantic-guided diffusion training mechanism that gradually replaces text conditioning with learnable query embeddings, enabling fine-grained, controllable latent space alignment.
- OmniBridge demonstrate the effectiveness of our framework through extensive experiments on standard vision-language benchmarks, validating that OmniBridge has achieved state-of-the-art or competitive performance in multimodal understanding, generation, and retrieval tasks.
Performance
Vision-Language Understanding
Multimodal Reasoning and Mathematics
OCR, Chart, and Document Understanding
Multi-Image Understanding
Real-World Comprehension
Comprehensive Multimodal Evaluation & Multimodal Hallucination Evaluation
Multimodal Understanding Cases
Image Generation
Performance on Geneval banchmark
Performance on DPG-Bench
Image Generation Cases
Image Editing
Performance on IMGEDIT-BENCH
Image Editing Cases
Multimodal Retrieval
News
- 2025.09 We relase OmniBridge which is a unified and modular multimodal framework that combines a language-centric design with efficient cross-modal alignment.
- 2025.08 We introduce OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture.
TODO
- Release model weights of OmniBridge.
Setup
Clone this repository and install required packages:
git clone https://github.com/xiao-xt/OmniBridge
pip install -r requirements.txt
And you need to download the weights of the Decoder of HunyuanDiT for image generation: https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2
Model Weights
| Model name | HF Weight | Modelscope |
|---|---|---|
| OmniBridge | 🤗 HF link | Modelscope link |
| OmniBridge-Retrieval-Finetuned | 🤗 HF link | Modelscope link |
Quickstart
Use 🤗Transformers to run OmniBridge for vision-language understanding
python ./multimodal_understanding.py
Use 🤗Transformers to run OmniBridge for image generation
python ./image_generation.py
Use 🤗Transformers to run OmniBridge for image editing
python ./image_editing.py
Use 🤗Transformers to run OmniBridge for multimodal retrieval
python ./multimodal_retrieval.py
Citation
If you find Emu3 useful for your research and applications, please consider starring this repository and citing:
@article{xiao2025omnibridge,
title={OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment},
author={Xiao, Teng and Li, Zuchao and Zhang, Lefei},
journal={arXiv preprint arXiv:2509.19018},
year={2025}
}