zyhuangnus's picture
Update README.md
386cfef verified
---
license: apache-2.0
language:
- en
- zh
pipeline_tag: any-to-any
library_name: transformers
---
## Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer
<p align="center">📑 <a href="https://arxiv.org/pdf/2510.06590">Technical Report</a>|📖<a href="https://inclusionai.github.io/blog/mingtok/">Project Page</a> |🤗 <a href="https://huggingface.co/inclusionAI/Ming-UniVision-16B-A3B">Hugging Face</a>| 🤖 <a href="https://www.modelscope.cn/models/inclusionAI/Ming-UniVision-16B-A3B">ModelScope</a>| 💾 <a href="https://github.com/inclusionAI/Ming-UniVision">GitHub</a></p>
## Key Features
- 🌐 **First Unified Autoregressive MLLM with Continuous Vision Tokens:** Ming-UniVision is the first multimodal large language model that natively integrates continuous visual representations from MingTok into a next-token prediction (NTP) framework—unifying vision and language under a single autoregressive paradigm without discrete quantization or modality-specific heads.
-**3.5× Faster Convergence in Joint Vision-Language Training:** The coherent representational space between understanding and generation—enabled by MingTok—reduces optimization conflicts across tasks, leading to dramatically faster convergence during end-to-end multimodal pretraining.
- 🔄 **Multi-Round In-Context Vision Tasks:** Ming-UniVision supports iterative understanding, generation, and editing entirely within the continuous latent space—without the need to decode intermediate states into images, enabling efficient and coherent multimodal reasoning.Users can alternate between asking questions and requesting edits, just like conversing with a human.
<div align="center">
<img src="assets/0830-mingtok-fig1.jpg" alt="Conceptual comparison and qualitative examples" width="80%"/>
</div>
**Figure 1: Conceptual comparison and qualitative examples of Ming-UniVision built upon MingTok.**
<div align="center">
<img src="assets/ming-univision.jpg" alt="Model Architecture" width="80%"/>
</div>
**Figure 2: Multi-Round image understanding, generation and editing architecture of Ming-UniVision, powered by MingTok.**
## Usage
```python
from mingunivisioninfer import MingUniVisionInfer
model = MingUniVisionInfer("inclusionAI/Ming-UniVision-16B-A3B")
# single round generation
image_gen_prompt = "Please generate the corresponding image based on the description. A cute girl."
messages = [{
"role": "HUMAN",
"content": [{"type": "text", "text": image_gen_prompt},],
}]
output_text = model.generate(messages, max_new_tokens=512, output_image_prefix="a_cute_girl")
model.reset_inner_state()
# single ground understanding
messages = [{
"role": "HUMAN",
"content": [
{"type": "image", "image": "a_cute_girl.png"},
{"type": "text", "text": "Please describe the picture in detail."},
],
}]
output_text = model.generate(messages, max_new_tokens=512)
print(output_text)
model.reset_inner_state()
# multi-round editing
messages = [{
"role": "HUMAN",
"content": [
{"type": "image", "image": "a_cute_girl.png"},
{"type": "text", "text": "Given the edit instruction: Change the color of her cloth to red, please identify the editing region"},
],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_0")
messages = [{
"role": "HUMAN",
"content": [
{"type": "text", "text": "Change the color of her cloth to red."},
],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_1")
messages = [{
"role": "HUMAN",
"content": [
{"type": "text", "text": "Refine the image for better clarity."},
],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_2")
model.reset_inner_state()
# single round text-based conversation
messages = [{
"role": "HUMAN",
"content": [
{"type": "text", "text": "请详细介绍鹦鹉的习性。"},
],
}]
output_text = model.generate(messages, max_new_tokens=512)
print(output_text)
model.reset_inner_state()
```
📌 Tips:
- Image generation: Use descriptive prompts + ``output_image_prefix`` to save output.
- Image understanding: Include "image" and "text" in the same message.
- Image editing: Chain multiple ``generate(..., for_edit=True)`` calls with unique ``output_image_prefix`` names.
- Multi-turn interactions are supported via internal state — call ``model.reset_inner_state()`` to reset.
- Input types: "text" and "image" — flexible order, mixed inputs allowed.
📝 Note (Model Limitations):
- The current model was **trained with only two-turn conversations**, and has not been optimized for alternating rounds of image understanding and generation, although it may generalize to more than two turns during inference. As a result, performance may be limited in complex, multi-modal dialogue scenarios requiring deep contextual reasoning across turns.
- This open-sourced version was **trained using mixed-resolution strategies**: high resolution for image understanding, but lower resolution for image editing and generation. Additionally, large-scale interleaved image-text data was not included during pretraining.
- Due to these factors, *image editing quality and consistency may be suboptimal* compared to fully end-to-end, high-resolution multimodal models. We are actively working on improved versions with unified resolution training and richer interleaved data.
## Performance
### Image Reconstruction
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>Quantitative Evaluations on Multimodal Benchmarks</title>
<style>
body {
font-family: Arial, sans-serif;
margin: 20px;
}
table {
width: 100%;
border-collapse: collapse;
font-size: 12px;
}
th, td {
border: 1px solid #ccc;
padding: 6px 8px;
text-align: center;
}
thead th {
background-color: transparent;
font-weight: bold;
}
.section-row {
background-color: transparent;
text-align: center;
font-style: italic;
}
.uparrow {
font-size: 10px; vertical-align: super;
}
.dagger {
font-size: 10px; color: gray;
}
caption {
font-weight: bold;
font-size: 14px;
margin: 10px 0;
text-align: left;
}
</style>
</head>
<body>
<table>
<caption>
<strong>Table 1. Quantitative evaluations</strong> on MMBench, MMStar, MMMU, MathVista, HallusionBench, AI2D, MM-Vet, OCRBench, and MME.
</caption>
<thead>
<tr>
<th>Model</th>
<th>MMB&nbsp;<span class="uparrow">↑</span></th>
<th>MMS&nbsp;<span class="uparrow">↑</span></th>
<th>MMMU&nbsp;<span class="uparrow">↑</span></th>
<th>MathV&nbsp;<span class="uparrow">↑</span></th>
<th>Hall&nbsp;<span class="uparrow">↑</span></th>
<th>AI2D&nbsp;<span class="uparrow">↑</span></th>
<th>MM-Vet&nbsp;<span class="uparrow">↑</span></th>
<th>OCRBench&nbsp;<span class="uparrow">↑</span></th>
<th>MME&nbsp;<span class="uparrow">↑</span></th>
</tr>
</thead>
<tbody>
<!-- Understanding Only -->
<tr class="section-row">
<td colspan="10"><em>Understanding Only</em></td>
</tr>
<tr>
<td>Emu3-Chat</td>
<td>58.5</td>
<td>-</td>
<td>31.6</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>37.2</td>
<td>687</td>
<td>-</td>
</tr>
<tr>
<td>Qwen2.5-VL-3B</td>
<td>79.1</td>
<td>55.9</td>
<td>53.1</td>
<td>62.3</td>
<td>46.3</td>
<td>81.6</td>
<td>-</td>
<td>797</td>
<td>2157</td>
</tr>
<tr>
<td>Qwen2.5-VL-7B</td>
<td>83.5</td>
<td>63.9</td>
<td>58.6</td>
<td>68.2</td>
<td>52.9</td>
<td>83.9</td>
<td>67.1</td>
<td>864</td>
<td>2347</td>
</tr>
<tr>
<td>InternVL2.5-4B</td>
<td>81.1</td>
<td>58.3</td>
<td>52.3</td>
<td>60.5</td>
<td>46.3</td>
<td>81.4</td>
<td>60.6</td>
<td>828</td>
<td>2338</td>
</tr>
<tr>
<td>InternVL2.5-8B</td>
<td>84.6</td>
<td>62.8</td>
<td>56.0</td>
<td>64.4</td>
<td>50.1</td>
<td>84.5</td>
<td>62.8</td>
<td>822</td>
<td>2344</td>
</tr>
<tr>
<td>DeepSeek-VL2</td>
<td>79.6</td>
<td>61.3</td>
<td>51.1</td>
<td>62.8</td>
<td>-</td>
<td>81.4</td>
<td>-</td>
<td>811</td>
<td>2253</td>
</tr>
<!-- Unified model, Separate representation -->
<tr class="section-row">
<td colspan="10"><em>Unified model, Separate representation</em></td>
</tr>
<tr>
<td>Janus-Pro-7B</td>
<td>79.2</td>
<td>-</td>
<td>41.0</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>50.0</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>LMFusion</td>
<td>-</td>
<td>-</td>
<td>41.7</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>1603</td>
</tr>
<tr>
<td>MetaQuery-L</td>
<td>78.6</td>
<td>-</td>
<td>53.1</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>63.2</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Show-o2-7B</td>
<td>79.3</td>
<td>56.6</td>
<td>48.9</td>
<td>-</td>
<td>-</td>
<td>78.6</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>BLIP3-o 4B</td>
<td>78.6</td>
<td>-</td>
<td>46.6</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>60.1</td>
<td>-</td>
<td>2161</td>
</tr>
<tr>
<td>BAGEL</td>
<td>85.0</td>
<td>-</td>
<td>55.3</td>
<td>73.1</td>
<td>-</td>
<td>-</td>
<td>67.2</td>
<td>-</td>
<td>2388</td>
</tr>
<!-- Unified model, Unified representation -->
<tr class="section-row">
<td colspan="10"><em>Unified model, Unified representation</em></td>
</tr>
<tr>
<td>VILA-U</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>33.5</td>
<td>-</td>
<td>1402</td>
</tr>
<tr>
<td>TokenFlow-XL</td>
<td>76.8</td>
<td>-</td>
<td>43.2</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>48.2</td>
<td>-</td>
<td>1922</td>
</tr>
<tr>
<td>UniTok</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>33.9</td>
<td>-</td>
<td>1448</td>
</tr>
<tr>
<td>Harmon-1.5B</td>
<td>65.5</td>
<td>-</td>
<td>38.9</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>1476</td>
</tr>
<tr>
<td>TokLIP</td>
<td>67.6</td>
<td>-</td>
<td>43.1</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>29.8</td>
<td>-</td>
<td>-</td>
</tr>
<!-- Dashed line separator -->
<tr><td colspan="10" style="border-top: 1px dashed #aaa;"></td></tr>
<!-- Ours -->
<tr>
<td><strong>Ming-UniVision-16B-A3B (Ours)</strong></td>
<td>78.5</td>
<td>63.7</td>
<td>40.3</td>
<td>66.6</td>
<td>47.8</td>
<td>82.8</td>
<td>64.2</td>
<td>724</td>
<td>2023</td>
</tr>
</tbody>
</table>
</body>
</html>
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>Text-to-Image Generation Evaluation</title>
<style>
body {
font-family: Arial, sans-serif;
margin: 20px;
}
table {
width: 100%;
border-collapse: collapse;
font-size: 12px;
}
th, td {
border: 1px solid #ccc;
padding: 6px 8px;
text-align: center;
}
thead th {
background-color: transparent;
font-weight: bold;
}
.section-row {
background-color: transparent;
text-align: center;
font-style: italic;
}
.uparrow {
font-size: 10px; vertical-align: super;
}
.dagger {
font-size: 10px; color: gray;
}
caption {
font-weight: bold;
font-size: 14px;
margin: 10px 0;
text-align: left;
}
</style>
</head>
<body>
<table>
<caption>
<strong>Table 2. Evaluation of text-to-image generation ability</strong> on GenEval and DPG-Bench.
<span class="dagger">† denotes performance obtained by rewritten prompts.</span>
</caption>
<thead>
<tr>
<th>Method</th>
<th>Single Obj.&nbsp;<span class="uparrow">↑</span></th>
<th>Two Obj.&nbsp;<span class="uparrow">↑</span></th>
<th>Counting&nbsp;<span class="uparrow">↑</span></th>
<th>Colors&nbsp;<span class="uparrow">↑</span></th>
<th>Position&nbsp;<span class="uparrow">↑</span></th>
<th>Color Attri.&nbsp;<span class="uparrow">↑</span></th>
<th>Overall&nbsp;<span class="uparrow">↑</span></th>
<th>DPG-Bench&nbsp;<span class="uparrow">↑</span></th>
</tr>
</thead>
<tbody>
<!-- Generation Only -->
<tr class="section-row">
<td colspan="9"><em>Generation Only</em></td>
</tr>
<tr>
<td>LlamaGen</td>
<td>0.71</td>
<td>0.34</td>
<td>0.21</td>
<td>0.58</td>
<td>0.07</td>
<td>0.04</td>
<td>0.32</td>
<td>-</td>
</tr>
<tr>
<td>PixArt-α</td>
<td>0.98</td>
<td>0.50</td>
<td>0.44</td>
<td>0.80</td>
<td>0.08</td>
<td>0.07</td>
<td>0.48</td>
<td>-</td>
</tr>
<tr>
<td>SDv2.1</td>
<td>0.98</td>
<td>0.51</td>
<td>0.44</td>
<td>0.85</td>
<td>0.07</td>
<td>0.17</td>
<td>0.50</td>
<td>-</td>
</tr>
<tr>
<td>DALL-E 2</td>
<td>0.94</td>
<td>0.66</td>
<td>0.49</td>
<td>0.77</td>
<td>0.10</td>
<td>0.19</td>
<td>0.52</td>
<td>-</td>
</tr>
<tr>
<td>Emu3-Gen</td>
<td>0.98</td>
<td>0.71</td>
<td>0.34</td>
<td>0.81</td>
<td>0.17</td>
<td>0.21</td>
<td>0.54</td>
<td>80.60</td>
</tr>
<tr>
<td>SDXL</td>
<td>0.98</td>
<td>0.74</td>
<td>0.39</td>
<td>0.85</td>
<td>0.15</td>
<td>0.23</td>
<td>0.55</td>
<td>74.65</td>
</tr>
<tr>
<td>DALL-E 3</td>
<td>0.96</td>
<td>0.87</td>
<td>0.47</td>
<td>0.83</td>
<td>0.43</td>
<td>0.45</td>
<td>0.67</td>
<td>83.50</td>
</tr>
<tr>
<td>SD3-Medium</td>
<td>0.99</td>
<td><strong>0.94</strong></td>
<td>0.72</td>
<td>0.89</td>
<td>0.33</td>
<td>0.60</td>
<td>0.74</td>
<td>84.08</td>
</tr>
<!-- Unified model, Separate representation -->
<tr class="section-row">
<td colspan="9"><em>Unified model, Separate representation</em></td>
</tr>
<tr>
<td>Show-o</td>
<td>0.95</td>
<td>0.52</td>
<td>0.49</td>
<td>0.82</td>
<td>0.11</td>
<td>0.28</td>
<td>0.53</td>
<td>-</td>
</tr>
<tr>
<td>Ming-Lite-Uni</td>
<td>0.99</td>
<td>0.76</td>
<td>0.53</td>
<td>0.87</td>
<td>0.26</td>
<td>0.30</td>
<td>0.62</td>
<td>-</td>
</tr>
<tr>
<td>Janus-Pro-1B</td>
<td>0.98</td>
<td>0.82</td>
<td>0.51</td>
<td>0.89</td>
<td>0.65</td>
<td>0.56</td>
<td>0.73</td>
<td>82.63</td>
</tr>
<tr>
<td>Janus-Pro-7B</td>
<td>0.99</td>
<td>0.89</td>
<td>0.59</td>
<td>0.90</td>
<td>0.79</td>
<td>0.66</td>
<td>0.80</td>
<td>84.19</td>
</tr>
<tr>
<td>Show-o2-7B</td>
<td><strong>1.00</strong></td>
<td>0.87</td>
<td>0.58</td>
<td>0.92</td>
<td>0.52</td>
<td>0.62</td>
<td>0.76</td>
<td><strong>86.14</strong></td>
</tr>
<tr>
<td>MetaQuery-L†</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>0.78</td>
<td>81.10</td>
</tr>
<tr>
<td>Blip3-o 4B</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>0.81</td>
<td>79.36</td>
</tr>
<tr>
<td>BAGEL</td>
<td>0.99</td>
<td><strong>0.94</strong></td>
<td><strong>0.81</strong></td>
<td>0.88</td>
<td>0.64</td>
<td>0.63</td>
<td>0.82</td>
<td>-</td>
</tr>
<!-- Unified model, Unified representation -->
<tr class="section-row">
<td colspan="9"><em>Unified model, Unified representation</em></td>
</tr>
<tr>
<td>Harmon-1.5B</td>
<td>0.99</td>
<td>0.86</td>
<td>0.66</td>
<td>0.85</td>
<td>0.74</td>
<td>0.48</td>
<td>0.79</td>
<td>-</td>
</tr>
<tr>
<td>TokenFlow-XL</td>
<td>0.95</td>
<td>0.60</td>
<td>0.41</td>
<td>0.81</td>
<td>0.16</td>
<td>0.24</td>
<td>0.55</td>
<td>73.38</td>
</tr>
<!-- Dashed line separator -->
<tr><td colspan="9" style="border-top: 1px dashed #aaa;"></td></tr>
<!-- Ours -->
<tr>
<td><strong>Ming-UniVision-16B-A3B (Ours)</strong></td>
<td><strong>1.00</strong></td>
<td>0.93</td>
<td>0.59</td>
<td><strong>0.93</strong></td>
<td><strong>0.92</strong></td>
<td><strong>0.70</strong></td>
<td><strong>0.85</strong></td>
<td>82.12</td>
</tr>
</tbody>
</table>
</body>
</html>
## Reference
```
@article{huang2025mingunivision,
title={Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer},
author={Huang, Ziyuan and Zheng, DanDan and Zou, Cheng and Liu, Rui and Wang, Xiaolong and Ji, Kaixiang and Chai, Weilong and Sun, Jianxin and Wang, Libin and Lv, Yongjie and Huang, Taozhi and Liu, Jiajia and Guo, Qingpei and Yang, Ming and Chen, Jingdong and Zhou, Jun},
journal={arXiv preprint arXiv:2510.06590},
year={2025}
}
```