YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment

we propose OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture. OmniBridge adopts a language-centric design that reuses pretrained LLMs and introduces a lightweight bidirectional latent alignment module for decoupling visual generation, multimodal retrieval, and latent space alignment from the core LLM.

OmniBridge excels in both generation and perception

OmniBridge demonstrate the effectiveness of our framework through extensive experiments on standard vision-language benchmarks, validating that OmniBridge has achieved state-of-the-art or competitive performance in multimodal understanding, generation, and retrieval tasks.

Highlights

OmniBridge is a unified and modular multimodal framework that supports understanding, generation, and retrieval tasks within a single architecture.
OmniBridge introduce a two-stage decoupled training strategy that separates behavioral alignment from latent-level alignment, enabling efficient and stable adaptation across diverse multimodal tasks
OmniBridge design a novel semantic-guided diffusion training mechanism that gradually replaces text conditioning with learnable query embeddings, enabling fine-grained, controllable latent space alignment.
OmniBridge demonstrate the effectiveness of our framework through extensive experiments on standard vision-language benchmarks, validating that OmniBridge has achieved state-of-the-art or competitive performance in multimodal understanding, generation, and retrieval tasks.

Performance

Vision-Language Understanding

Multimodal Reasoning and Mathematics

OCR, Chart, and Document Understanding

Multi-Image Understanding

Real-World Comprehension

Comprehensive Multimodal Evaluation & Multimodal Hallucination Evaluation

Multimodal Understanding Cases

Image Generation

Performance on Geneval banchmark

Performance on DPG-Bench

Image Generation Cases

Image Editing

Performance on IMGEDIT-BENCH

Image Editing Cases

Multimodal Retrieval

News

2025.09 We relase OmniBridge which is a unified and modular multimodal framework that combines a language-centric design with efficient cross-modal alignment.
2025.08 We introduce OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture.

TODO

Release model weights of OmniBridge.

Setup

Clone this repository and install required packages:

git clone https://github.com/xiao-xt/OmniBridge

pip install -r requirements.txt

And you need to download the weights of the Decoder of HunyuanDiT for image generation: https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2

Model Weights

Model name	HF Weight	Modelscope
OmniBridge	🤗 HF link	Modelscope link
OmniBridge-Retrieval-Finetuned	🤗 HF link	Modelscope link

Quickstart

Use 🤗Transformers to run OmniBridge for vision-language understanding

python ./multimodal_understanding.py

Use 🤗Transformers to run OmniBridge for image generation

python ./image_generation.py

Use 🤗Transformers to run OmniBridge for image editing

python ./image_editing.py

Use 🤗Transformers to run OmniBridge for multimodal retrieval

python ./multimodal_retrieval.py

Citation

If you find Emu3 useful for your research and applications, please consider starring this repository and citing:

@article{xiao2025omnibridge,
  title={OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment},
  author={Xiao, Teng and Li, Zuchao and Zhang, Lefei},
  journal={arXiv preprint arXiv:2509.19018},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support