YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment

| Github | Paper | 🤗HF Models | Modelscope |

arch.

we propose OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture. OmniBridge adopts a language-centric design that reuses pretrained LLMs and introduces a lightweight bidirectional latent alignment module for decoupling visual generation, multimodal retrieval, and latent space alignment from the core LLM.

arch.

OmniBridge excels in both generation and perception

OmniBridge demonstrate the effectiveness of our framework through extensive experiments on standard vision-language benchmarks, validating that OmniBridge has achieved state-of-the-art or competitive performance in multimodal understanding, generation, and retrieval tasks.

comparison.
comparison.

Highlights

  • OmniBridge is a unified and modular multimodal framework that supports understanding, generation, and retrieval tasks within a single architecture.
  • OmniBridge introduce a two-stage decoupled training strategy that separates behavioral alignment from latent-level alignment, enabling efficient and stable adaptation across diverse multimodal tasks
  • OmniBridge design a novel semantic-guided diffusion training mechanism that gradually replaces text conditioning with learnable query embeddings, enabling fine-grained, controllable latent space alignment.
  • OmniBridge demonstrate the effectiveness of our framework through extensive experiments on standard vision-language benchmarks, validating that OmniBridge has achieved state-of-the-art or competitive performance in multimodal understanding, generation, and retrieval tasks.

Performance

Vision-Language Understanding

Multimodal Reasoning and Mathematics

comparison.
comparison.

OCR, Chart, and Document Understanding

comparison.

Multi-Image Understanding

comparison.

Real-World Comprehension

comparison.

Comprehensive Multimodal Evaluation & Multimodal Hallucination Evaluation

comparison.

Multimodal Understanding Cases

comparison.

Image Generation

Performance on Geneval banchmark

comparison.

Performance on DPG-Bench

comparison.

Image Generation Cases

comparison.
comparison.

Image Editing

Performance on IMGEDIT-BENCH

comparison.

Image Editing Cases

comparison.

Multimodal Retrieval

comparison.

News

  • 2025.09 We relase OmniBridge which is a unified and modular multimodal framework that combines a language-centric design with efficient cross-modal alignment.
  • 2025.08 We introduce OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture.

TODO

  • Release model weights of OmniBridge.

Setup

Clone this repository and install required packages:

git clone https://github.com/xiao-xt/OmniBridge

pip install -r requirements.txt

And you need to download the weights of the Decoder of HunyuanDiT for image generation: https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2

Model Weights

Model name HF Weight Modelscope
OmniBridge 🤗 HF link Modelscope link
OmniBridge-Retrieval-Finetuned 🤗 HF link Modelscope link

Quickstart

Use 🤗Transformers to run OmniBridge for vision-language understanding

python ./multimodal_understanding.py

Use 🤗Transformers to run OmniBridge for image generation

python ./image_generation.py

Use 🤗Transformers to run OmniBridge for image editing

python ./image_editing.py

Use 🤗Transformers to run OmniBridge for multimodal retrieval

python ./multimodal_retrieval.py

Citation

If you find Emu3 useful for your research and applications, please consider starring this repository and citing:

@article{xiao2025omnibridge,
  title={OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment},
  author={Xiao, Teng and Li, Zuchao and Zhang, Lefei},
  journal={arXiv preprint arXiv:2509.19018},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support