---
license: apache-2.0
language:
- en
- zh
pipeline_tag: any-to-any
library_name: transformers
---
## Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer
📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope| 💾 GitHub
## Key Features
- 🌐 **First Unified Autoregressive MLLM with Continuous Vision Tokens:** Ming-UniVision is the first multimodal large language model that natively integrates continuous visual representations from MingTok into a next-token prediction (NTP) framework—unifying vision and language under a single autoregressive paradigm without discrete quantization or modality-specific heads.
- ⚡ **3.5× Faster Convergence in Joint Vision-Language Training:** The coherent representational space between understanding and generation—enabled by MingTok—reduces optimization conflicts across tasks, leading to dramatically faster convergence during end-to-end multimodal pretraining.
- 🔄 **Multi-Round In-Context Vision Tasks:** Ming-UniVision supports iterative understanding, generation, and editing entirely within the continuous latent space—without the need to decode intermediate states into images, enabling efficient and coherent multimodal reasoning.Users can alternate between asking questions and requesting edits, just like conversing with a human.
**Figure 1: Conceptual comparison and qualitative examples of Ming-UniVision built upon MingTok.**
**Figure 2: Multi-Round image understanding, generation and editing architecture of Ming-UniVision, powered by MingTok.**
## Usage
```python
from mingunivisioninfer import MingUniVisionInfer
model = MingUniVisionInfer("inclusionAI/Ming-UniVision-16B-A3B")
# single round generation
image_gen_prompt = "Please generate the corresponding image based on the description. A cute girl."
messages = [{
"role": "HUMAN",
"content": [{"type": "text", "text": image_gen_prompt},],
}]
output_text = model.generate(messages, max_new_tokens=512, output_image_prefix="a_cute_girl")
model.reset_inner_state()
# single ground understanding
messages = [{
"role": "HUMAN",
"content": [
{"type": "image", "image": "a_cute_girl.png"},
{"type": "text", "text": "Please describe the picture in detail."},
],
}]
output_text = model.generate(messages, max_new_tokens=512)
print(output_text)
model.reset_inner_state()
# multi-round editing
messages = [{
"role": "HUMAN",
"content": [
{"type": "image", "image": "a_cute_girl.png"},
{"type": "text", "text": "Given the edit instruction: Change the color of her cloth to red, please identify the editing region"},
],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_0")
messages = [{
"role": "HUMAN",
"content": [
{"type": "text", "text": "Change the color of her cloth to red."},
],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_1")
messages = [{
"role": "HUMAN",
"content": [
{"type": "text", "text": "Refine the image for better clarity."},
],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_2")
model.reset_inner_state()
# single round text-based conversation
messages = [{
"role": "HUMAN",
"content": [
{"type": "text", "text": "请详细介绍鹦鹉的习性。"},
],
}]
output_text = model.generate(messages, max_new_tokens=512)
print(output_text)
model.reset_inner_state()
```
📌 Tips:
- Image generation: Use descriptive prompts + ``output_image_prefix`` to save output.
- Image understanding: Include "image" and "text" in the same message.
- Image editing: Chain multiple ``generate(..., for_edit=True)`` calls with unique ``output_image_prefix`` names.
- Multi-turn interactions are supported via internal state — call ``model.reset_inner_state()`` to reset.
- Input types: "text" and "image" — flexible order, mixed inputs allowed.
📝 Note (Model Limitations):
- The current model was **trained with only two-turn conversations**, and has not been optimized for alternating rounds of image understanding and generation, although it may generalize to more than two turns during inference. As a result, performance may be limited in complex, multi-modal dialogue scenarios requiring deep contextual reasoning across turns.
- This open-sourced version was **trained using mixed-resolution strategies**: high resolution for image understanding, but lower resolution for image editing and generation. Additionally, large-scale interleaved image-text data was not included during pretraining.
- Due to these factors, *image editing quality and consistency may be suboptimal* compared to fully end-to-end, high-resolution multimodal models. We are actively working on improved versions with unified resolution training and richer interleaved data.
## Performance
### Image Reconstruction
Quantitative Evaluations on Multimodal Benchmarks
Table 1. Quantitative evaluations on MMBench, MMStar, MMMU, MathVista, HallusionBench, AI2D, MM-Vet, OCRBench, and MME.
| Model |
MMB ↑ |
MMS ↑ |
MMMU ↑ |
MathV ↑ |
Hall ↑ |
AI2D ↑ |
MM-Vet ↑ |
OCRBench ↑ |
MME ↑ |
| Understanding Only |
| Emu3-Chat |
58.5 |
- |
31.6 |
- |
- |
- |
37.2 |
687 |
- |
| Qwen2.5-VL-3B |
79.1 |
55.9 |
53.1 |
62.3 |
46.3 |
81.6 |
- |
797 |
2157 |
| Qwen2.5-VL-7B |
83.5 |
63.9 |
58.6 |
68.2 |
52.9 |
83.9 |
67.1 |
864 |
2347 |
| InternVL2.5-4B |
81.1 |
58.3 |
52.3 |
60.5 |
46.3 |
81.4 |
60.6 |
828 |
2338 |
| InternVL2.5-8B |
84.6 |
62.8 |
56.0 |
64.4 |
50.1 |
84.5 |
62.8 |
822 |
2344 |
| DeepSeek-VL2 |
79.6 |
61.3 |
51.1 |
62.8 |
- |
81.4 |
- |
811 |
2253 |
| Unified model, Separate representation |
| Janus-Pro-7B |
79.2 |
- |
41.0 |
- |
- |
- |
50.0 |
- |
- |
| LMFusion |
- |
- |
41.7 |
- |
- |
- |
- |
- |
1603 |
| MetaQuery-L |
78.6 |
- |
53.1 |
- |
- |
- |
63.2 |
- |
- |
| Show-o2-7B |
79.3 |
56.6 |
48.9 |
- |
- |
78.6 |
- |
- |
- |
| BLIP3-o 4B |
78.6 |
- |
46.6 |
- |
- |
- |
60.1 |
- |
2161 |
| BAGEL |
85.0 |
- |
55.3 |
73.1 |
- |
- |
67.2 |
- |
2388 |
| Unified model, Unified representation |
| VILA-U |
- |
- |
- |
- |
- |
- |
33.5 |
- |
1402 |
| TokenFlow-XL |
76.8 |
- |
43.2 |
- |
- |
- |
48.2 |
- |
1922 |
| UniTok |
- |
- |
- |
- |
- |
- |
33.9 |
- |
1448 |
| Harmon-1.5B |
65.5 |
- |
38.9 |
- |
- |
- |
- |
- |
1476 |
| TokLIP |
67.6 |
- |
43.1 |
- |
- |
- |
29.8 |
- |
- |
|
| Ming-UniVision-16B-A3B (Ours) |
78.5 |
63.7 |
40.3 |
66.6 |
47.8 |
82.8 |
64.2 |
724 |
2023 |
Text-to-Image Generation Evaluation
Table 2. Evaluation of text-to-image generation ability on GenEval and DPG-Bench.
† denotes performance obtained by rewritten prompts.
| Method |
Single Obj. ↑ |
Two Obj. ↑ |
Counting ↑ |
Colors ↑ |
Position ↑ |
Color Attri. ↑ |
Overall ↑ |
DPG-Bench ↑ |
| Generation Only |
| LlamaGen |
0.71 |
0.34 |
0.21 |
0.58 |
0.07 |
0.04 |
0.32 |
- |
| PixArt-α |
0.98 |
0.50 |
0.44 |
0.80 |
0.08 |
0.07 |
0.48 |
- |
| SDv2.1 |
0.98 |
0.51 |
0.44 |
0.85 |
0.07 |
0.17 |
0.50 |
- |
| DALL-E 2 |
0.94 |
0.66 |
0.49 |
0.77 |
0.10 |
0.19 |
0.52 |
- |
| Emu3-Gen |
0.98 |
0.71 |
0.34 |
0.81 |
0.17 |
0.21 |
0.54 |
80.60 |
| SDXL |
0.98 |
0.74 |
0.39 |
0.85 |
0.15 |
0.23 |
0.55 |
74.65 |
| DALL-E 3 |
0.96 |
0.87 |
0.47 |
0.83 |
0.43 |
0.45 |
0.67 |
83.50 |
| SD3-Medium |
0.99 |
0.94 |
0.72 |
0.89 |
0.33 |
0.60 |
0.74 |
84.08 |
| Unified model, Separate representation |
| Show-o |
0.95 |
0.52 |
0.49 |
0.82 |
0.11 |
0.28 |
0.53 |
- |
| Ming-Lite-Uni |
0.99 |
0.76 |
0.53 |
0.87 |
0.26 |
0.30 |
0.62 |
- |
| Janus-Pro-1B |
0.98 |
0.82 |
0.51 |
0.89 |
0.65 |
0.56 |
0.73 |
82.63 |
| Janus-Pro-7B |
0.99 |
0.89 |
0.59 |
0.90 |
0.79 |
0.66 |
0.80 |
84.19 |
| Show-o2-7B |
1.00 |
0.87 |
0.58 |
0.92 |
0.52 |
0.62 |
0.76 |
86.14 |
| MetaQuery-L† |
- |
- |
- |
- |
- |
- |
0.78 |
81.10 |
| Blip3-o 4B |
- |
- |
- |
- |
- |
- |
0.81 |
79.36 |
| BAGEL |
0.99 |
0.94 |
0.81 |
0.88 |
0.64 |
0.63 |
0.82 |
- |
| Unified model, Unified representation |
| Harmon-1.5B |
0.99 |
0.86 |
0.66 |
0.85 |
0.74 |
0.48 |
0.79 |
- |
| TokenFlow-XL |
0.95 |
0.60 |
0.41 |
0.81 |
0.16 |
0.24 |
0.55 |
73.38 |
|
| Ming-UniVision-16B-A3B (Ours) |
1.00 |
0.93 |
0.59 |
0.93 |
0.92 |
0.70 |
0.85 |
82.12 |
## Reference
```
@article{huang2025mingunivision,
title={Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer},
author={Huang, Ziyuan and Zheng, DanDan and Zou, Cheng and Liu, Rui and Wang, Xiaolong and Ji, Kaixiang and Chai, Weilong and Sun, Jianxin and Wang, Libin and Lv, Yongjie and Huang, Taozhi and Liu, Jiajia and Guo, Qingpei and Yang, Ming and Chen, Jingdong and Zhou, Jun},
journal={arXiv preprint arXiv:2510.06590},
year={2025}
}
```