wcy1122 commited on
Commit
cd56e1e
·
verified ·
1 Parent(s): 4d3a6b3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen3-1.7B
4
+ ---
5
+
6
+
7
+ # MGM-Omni-TTS-2B
8
+
9
+ <div align="left">
10
+
11
+ [![Github](https://img.shields.io/badge/Github-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/dvlab-research/MGM-Omni)
12
+ [![Blog](https://img.shields.io/badge/Blog-000000.svg?style=for-the-badge&logo=notion&logoColor=white)](https://mgm-omni.notion.site/MGM-Omni-An-Open-source-Omni-Chatbot-2395728e0b0180149ac9f24683fc9907?source=copy_link)
13
+ [![Models](https://img.shields.io/badge/Models-000000?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/collections/wcy1122/mgm-omni-6896075e97317a88825032e1)
14
+ [![Demo](https://img.shields.io/badge/Spaces-000000?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/spaces/wcy1122/MGM-Omni)
15
+
16
+ </div>
17
+
18
+
19
+ ## Introduction
20
+
21
+ MGM-Omni is an omni-chatbot capable of processing text, image, video, and speech inputs, and generating both text and speech responses.
22
+ MGM-Omni is capable of long-form speech understanding and generation, as well as zero-shot voice cloning in both Chinese and English.
23
+ MGM-Omni-TTS-2B is the SpeechLM component of MGM-Omni for speech generation. For the MLLM part, please refer MGM-Omni.
24
+
25
+
26
+ ## Main Properties
27
+
28
+ - **Omni-modality supports**: MGM-Omni supports audio, video, image, and text inputs, understands long contexts, and can generate both text and speech outputs, making it a truly versatile multi-modal AI assistant.
29
+ - **Long-form Speech Understanding**: Unlike most existing open-source multi-modal models, which typically fail with inputs longer than 15 minutes, MGM-Omni can handle hour-long speech inputs while delivering superior overall and detailed understanding and performance!
30
+ - **Long-form Speech Generation**: With a treasure trove of training data and smart Chunk-Based Decoding, MGM-Omni can generate over 10 minutes of smooth, natural speech for continuous storytelling.
31
+ - **Streaming Generation**: Thanks to the parallel decoding approach for speech tokens, MGM-Omni enables efficient and smooth streaming audio, making it suitable for live conversations.
32
+ - **Zero-shot Voice Cloning**: With MGM-Omni’s extensive and diverse audio training, you can create a customized voice clone by simply recording a short clip (around 10 seconds) and reviewing the results.
33
+ - **Fully Open-source**: All the code, models, and training data will be released.
34
+
35
+
36
+ ## Evaluation
37
+
38
+ ### Speech and Audio Understanding
39
+
40
+ | Model | Date | LS-clean↓ | LS-other↓ | CM-EN↓ | CM-ZH↓ | AISHELL↓ |
41
+ |:-----------------|:--------|:----------|:----------|:--------|:--------|:---------|
42
+ | Mini-Omni2 | 2024-11 | 4.7 | 9.4 | - | - | - |
43
+ | Lyra | 2024-12 | 2.0 | 4.0 | - | - | - |
44
+ | VITA-1.5 | 2025-01 | 3.4 | 7.5 | - | - | 2.2 |
45
+ | Qwen2.5-Omni | 2025-03 | 1.6 | 3.5 | **7.6** | 5.2 | - |
46
+ | Ola | 2025-06 | 1.9 | 4.3 | - | - | - |
47
+ | **MGM-Omni-7B** | 2025-08 | 1.7 | 3.6 | 8.8 | 4.5 | 1.9 |
48
+ | **MGM-Omni-32B** | 2025-08 | **1.5** | **3.2** | 8.0 | **4.0** | **1.8** |
49
+
50
+ This table presents WER and CER results on speech understanding.
51
+ Here LS refers to LibriSpeech and CM refers to Common Voice.
52
+
53
+ | Model | Date | Speech↑ | Sound↑ | Music↑ | Mix↑ | Average↑ |
54
+ |:-----------------|:--------|:--------|:--------|:--------|:--------|:---------|
55
+ | LLaMA-Omni | 2024-08 | 5.2 | 5.3 | 4.3 | 4.0 | 4.7 |
56
+ | Mini-Omni2 | 2024-11 | 3.6 | 3.5 | 2.6 | 3.1 | 3.2 |
57
+ | IXC2.5-OmniLive | 2024-12 | 1.6 | 1.8 | 1.7 | 1.6 | 1.7 |
58
+ | VITA-1.5 | 2025-01 | 4.8 | 5.5 | 4.9 | 2.9 | 4.5 |
59
+ | Qwen2.5-Omni | 2025-03 | 6.8 | 5.7 | 4.8 | 5.4 | 5.7 |
60
+ | Ola | 2025-06 | **7.3** | 6.4 | 5.9 | 6.0 | 6.4 |
61
+ | **MGM-Omni-7B** | 2025-08 | **7.3** | **6.5** | **6.3** | 6.1 | **6.5** |
62
+ | **MGM-Omni-32B** | 2025-08 | 7.1 | **6.5** | 6.2 | **6.2** | **6.5** |
63
+
64
+ This table presents evaluation results on AIR-Bench Chat (speech, sound, music, etc.).
65
+
66
+ ### Speech Generation
67
+
68
+ | Model | Date | Model Size | CER↓ | SS(ZH)↑ | WER↓ | SS(EN)↑ |
69
+ |:----------------|:--------|:-----------|:---------|:----------|:---------|:----------|
70
+ | CosyVoice2 | 2024-12 | 0.5B | 1.45 | 0.748 | 2.57 | 0.652 |
71
+ | Qwen2.5-Omni-3B | 2025-03 | 0.5B | 1.58 | 0.744 | 2.51 | 0.635 |
72
+ | Qwen2.5-Omni-7B | 2025-03 | 2B | 1.42 | 0.754 | 2.33 | 0.641 |
73
+ | MOSS-TTSD-v0 | 2025-06 | 2B | 2.18 | 0.594 | 2.46 | 0.476 |
74
+ | HiggsAudio-v2 | 2025-07 | 6B | 1.66 | 0.743 | 2.44 | 0.677 |
75
+ | **MGM-Omni** | 2025-08 | 0.6B | 1.49 | 0.749 | 2.54 | 0.670 |
76
+ | **MGM-Omni** | 2025-08 | 2B | 1.38 | 0.753 | 2.28 | 0.682 |
77
+ | **MGM-Omni** | 2025-08 | 4B | **1.34** | **0.756** | **2.22** | **0.684** |
78
+
79
+ This table presents evaluation results on speech generation on seed-tts-eval.
80
+ For Qwen2.5-Omni, model size refers to the size of the talker.
81
+
82
+
83
+ ## Citation
84
+ If you find this repo useful for your research, we would appreciate it if you could cite our work:
85
+ ```
86
+ @misc{wang2025mgmomni,
87
+ title={MGM-Omni: An Open-source Omni Chatbot},
88
+ author={Wang, Chengyao and Zhong, Zhisheng and Peng, Bohao and Yang, Senqiao and Liu, Yuqi and Yu, Bei and Jia, Jiaya},
89
+ year={2025},
90
+ howpublished={\url{https://mgm-omni.notion.site}},
91
+ note={Notion Blog}
92
+ }
93
+ ```