Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen3-1.7B
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
# MGM-Omni-TTS-2B
|
| 8 |
+
|
| 9 |
+
<div align="left">
|
| 10 |
+
|
| 11 |
+
[](https://github.com/dvlab-research/MGM-Omni)
|
| 12 |
+
[](https://mgm-omni.notion.site/MGM-Omni-An-Open-source-Omni-Chatbot-2395728e0b0180149ac9f24683fc9907?source=copy_link)
|
| 13 |
+
[](https://huggingface.co/collections/wcy1122/mgm-omni-6896075e97317a88825032e1)
|
| 14 |
+
[](https://huggingface.co/spaces/wcy1122/MGM-Omni)
|
| 15 |
+
|
| 16 |
+
</div>
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
## Introduction
|
| 20 |
+
|
| 21 |
+
MGM-Omni is an omni-chatbot capable of processing text, image, video, and speech inputs, and generating both text and speech responses.
|
| 22 |
+
MGM-Omni is capable of long-form speech understanding and generation, as well as zero-shot voice cloning in both Chinese and English.
|
| 23 |
+
MGM-Omni-TTS-2B is the SpeechLM component of MGM-Omni for speech generation. For the MLLM part, please refer MGM-Omni.
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
## Main Properties
|
| 27 |
+
|
| 28 |
+
- **Omni-modality supports**: MGM-Omni supports audio, video, image, and text inputs, understands long contexts, and can generate both text and speech outputs, making it a truly versatile multi-modal AI assistant.
|
| 29 |
+
- **Long-form Speech Understanding**: Unlike most existing open-source multi-modal models, which typically fail with inputs longer than 15 minutes, MGM-Omni can handle hour-long speech inputs while delivering superior overall and detailed understanding and performance!
|
| 30 |
+
- **Long-form Speech Generation**: With a treasure trove of training data and smart Chunk-Based Decoding, MGM-Omni can generate over 10 minutes of smooth, natural speech for continuous storytelling.
|
| 31 |
+
- **Streaming Generation**: Thanks to the parallel decoding approach for speech tokens, MGM-Omni enables efficient and smooth streaming audio, making it suitable for live conversations.
|
| 32 |
+
- **Zero-shot Voice Cloning**: With MGM-Omni’s extensive and diverse audio training, you can create a customized voice clone by simply recording a short clip (around 10 seconds) and reviewing the results.
|
| 33 |
+
- **Fully Open-source**: All the code, models, and training data will be released.
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
## Evaluation
|
| 37 |
+
|
| 38 |
+
### Speech and Audio Understanding
|
| 39 |
+
|
| 40 |
+
| Model | Date | LS-clean↓ | LS-other↓ | CM-EN↓ | CM-ZH↓ | AISHELL↓ |
|
| 41 |
+
|:-----------------|:--------|:----------|:----------|:--------|:--------|:---------|
|
| 42 |
+
| Mini-Omni2 | 2024-11 | 4.7 | 9.4 | - | - | - |
|
| 43 |
+
| Lyra | 2024-12 | 2.0 | 4.0 | - | - | - |
|
| 44 |
+
| VITA-1.5 | 2025-01 | 3.4 | 7.5 | - | - | 2.2 |
|
| 45 |
+
| Qwen2.5-Omni | 2025-03 | 1.6 | 3.5 | **7.6** | 5.2 | - |
|
| 46 |
+
| Ola | 2025-06 | 1.9 | 4.3 | - | - | - |
|
| 47 |
+
| **MGM-Omni-7B** | 2025-08 | 1.7 | 3.6 | 8.8 | 4.5 | 1.9 |
|
| 48 |
+
| **MGM-Omni-32B** | 2025-08 | **1.5** | **3.2** | 8.0 | **4.0** | **1.8** |
|
| 49 |
+
|
| 50 |
+
This table presents WER and CER results on speech understanding.
|
| 51 |
+
Here LS refers to LibriSpeech and CM refers to Common Voice.
|
| 52 |
+
|
| 53 |
+
| Model | Date | Speech↑ | Sound↑ | Music↑ | Mix↑ | Average↑ |
|
| 54 |
+
|:-----------------|:--------|:--------|:--------|:--------|:--------|:---------|
|
| 55 |
+
| LLaMA-Omni | 2024-08 | 5.2 | 5.3 | 4.3 | 4.0 | 4.7 |
|
| 56 |
+
| Mini-Omni2 | 2024-11 | 3.6 | 3.5 | 2.6 | 3.1 | 3.2 |
|
| 57 |
+
| IXC2.5-OmniLive | 2024-12 | 1.6 | 1.8 | 1.7 | 1.6 | 1.7 |
|
| 58 |
+
| VITA-1.5 | 2025-01 | 4.8 | 5.5 | 4.9 | 2.9 | 4.5 |
|
| 59 |
+
| Qwen2.5-Omni | 2025-03 | 6.8 | 5.7 | 4.8 | 5.4 | 5.7 |
|
| 60 |
+
| Ola | 2025-06 | **7.3** | 6.4 | 5.9 | 6.0 | 6.4 |
|
| 61 |
+
| **MGM-Omni-7B** | 2025-08 | **7.3** | **6.5** | **6.3** | 6.1 | **6.5** |
|
| 62 |
+
| **MGM-Omni-32B** | 2025-08 | 7.1 | **6.5** | 6.2 | **6.2** | **6.5** |
|
| 63 |
+
|
| 64 |
+
This table presents evaluation results on AIR-Bench Chat (speech, sound, music, etc.).
|
| 65 |
+
|
| 66 |
+
### Speech Generation
|
| 67 |
+
|
| 68 |
+
| Model | Date | Model Size | CER↓ | SS(ZH)↑ | WER↓ | SS(EN)↑ |
|
| 69 |
+
|:----------------|:--------|:-----------|:---------|:----------|:---------|:----------|
|
| 70 |
+
| CosyVoice2 | 2024-12 | 0.5B | 1.45 | 0.748 | 2.57 | 0.652 |
|
| 71 |
+
| Qwen2.5-Omni-3B | 2025-03 | 0.5B | 1.58 | 0.744 | 2.51 | 0.635 |
|
| 72 |
+
| Qwen2.5-Omni-7B | 2025-03 | 2B | 1.42 | 0.754 | 2.33 | 0.641 |
|
| 73 |
+
| MOSS-TTSD-v0 | 2025-06 | 2B | 2.18 | 0.594 | 2.46 | 0.476 |
|
| 74 |
+
| HiggsAudio-v2 | 2025-07 | 6B | 1.66 | 0.743 | 2.44 | 0.677 |
|
| 75 |
+
| **MGM-Omni** | 2025-08 | 0.6B | 1.49 | 0.749 | 2.54 | 0.670 |
|
| 76 |
+
| **MGM-Omni** | 2025-08 | 2B | 1.38 | 0.753 | 2.28 | 0.682 |
|
| 77 |
+
| **MGM-Omni** | 2025-08 | 4B | **1.34** | **0.756** | **2.22** | **0.684** |
|
| 78 |
+
|
| 79 |
+
This table presents evaluation results on speech generation on seed-tts-eval.
|
| 80 |
+
For Qwen2.5-Omni, model size refers to the size of the talker.
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
## Citation
|
| 84 |
+
If you find this repo useful for your research, we would appreciate it if you could cite our work:
|
| 85 |
+
```
|
| 86 |
+
@misc{wang2025mgmomni,
|
| 87 |
+
title={MGM-Omni: An Open-source Omni Chatbot},
|
| 88 |
+
author={Wang, Chengyao and Zhong, Zhisheng and Peng, Bohao and Yang, Senqiao and Liu, Yuqi and Yu, Bei and Jia, Jiaya},
|
| 89 |
+
year={2025},
|
| 90 |
+
howpublished={\url{https://mgm-omni.notion.site}},
|
| 91 |
+
note={Notion Blog}
|
| 92 |
+
}
|
| 93 |
+
```
|