wcy1122
/

MGM-Omni-TTS-2B

+---
+base_model:
+- Qwen/Qwen3-1.7B
+---
+# MGM-Omni-TTS-2B
+<div align="left">
+[![Github](https://img.shields.io/badge/Github-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/dvlab-research/MGM-Omni)
+[![Blog](https://img.shields.io/badge/Blog-000000.svg?style=for-the-badge&logo=notion&logoColor=white)](https://mgm-omni.notion.site/MGM-Omni-An-Open-source-Omni-Chatbot-2395728e0b0180149ac9f24683fc9907?source=copy_link)
+[![Models](https://img.shields.io/badge/Models-000000?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/collections/wcy1122/mgm-omni-6896075e97317a88825032e1)
+[![Demo](https://img.shields.io/badge/Spaces-000000?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/spaces/wcy1122/MGM-Omni)
+</div>
+## Introduction
+MGM-Omni is an omni-chatbot capable of processing text, image, video, and speech inputs, and generating both text and speech responses.
+MGM-Omni is capable of long-form speech understanding and generation, as well as zero-shot voice cloning in both Chinese and English.
+MGM-Omni-TTS-2B is the SpeechLM component of MGM-Omni for speech generation. For the MLLM part, please refer MGM-Omni.
+## Main Properties
+- **Omni-modality supports**: MGM-Omni supports audio, video, image, and text inputs, understands long contexts, and can generate both text and speech outputs, making it a truly versatile multi-modal AI assistant.
+- **Long-form Speech Understanding**: Unlike most existing open-source multi-modal models, which typically fail with inputs longer than 15 minutes, MGM-Omni can handle hour-long speech inputs while delivering superior overall and detailed understanding and performance!
+- **Long-form Speech Generation**: With a treasure trove of training data and smart Chunk-Based Decoding, MGM-Omni can generate over 10 minutes of smooth, natural speech for continuous storytelling.
+- **Streaming Generation**: Thanks to the parallel decoding approach for speech tokens, MGM-Omni enables efficient and smooth streaming audio, making it suitable for live conversations.
+- **Zero-shot Voice Cloning**: With MGM-Omni’s extensive and diverse audio training, you can create a customized voice clone by simply recording a short clip (around 10 seconds) and reviewing the results.
+- **Fully Open-source**: All the code, models, and training data will be released.
+## Evaluation
+### Speech and Audio Understanding
+| Model            | Date    | LS-clean↓ | LS-other↓ | CM-EN↓  | CM-ZH↓  | AISHELL↓ |
+|:-----------------|:--------|:----------|:----------|:--------|:--------|:---------|
+| Mini-Omni2       | 2024-11 | 4.7       | 9.4       | -       | -       | -        |
+| Lyra             | 2024-12 | 2.0       | 4.0       | -       | -       | -        |
+| VITA-1.5         | 2025-01 | 3.4       | 7.5       | -       | -       | 2.2      |
+| Qwen2.5-Omni     | 2025-03 | 1.6       | 3.5       | **7.6** | 5.2     | -        |
+| Ola              | 2025-06 | 1.9       | 4.3       | -       | -       | -        |
+| **MGM-Omni-7B**  | 2025-08 | 1.7       | 3.6       | 8.8     | 4.5     | 1.9      |
+| **MGM-Omni-32B** | 2025-08 | **1.5**   | **3.2**   | 8.0     | **4.0** | **1.8**  |
+This table presents WER and CER results on speech understanding.
+Here LS refers to LibriSpeech and CM refers to Common Voice.
+| Model            | Date    | Speech↑ | Sound↑  | Music↑  | Mix↑    | Average↑ |
+|:-----------------|:--------|:--------|:--------|:--------|:--------|:---------|
+| LLaMA-Omni       | 2024-08 | 5.2     | 5.3     | 4.3     | 4.0     | 4.7      |
+| Mini-Omni2       | 2024-11 | 3.6     | 3.5     | 2.6     | 3.1     | 3.2      |
+| IXC2.5-OmniLive  | 2024-12 | 1.6     | 1.8     | 1.7     | 1.6     | 1.7      |
+| VITA-1.5         | 2025-01 | 4.8     | 5.5     | 4.9     | 2.9     | 4.5      |
+| Qwen2.5-Omni     | 2025-03 | 6.8     | 5.7     | 4.8     | 5.4     | 5.7      |
+| Ola              | 2025-06 | **7.3** | 6.4     | 5.9     | 6.0     | 6.4      |
+| **MGM-Omni-7B**  | 2025-08 | **7.3** | **6.5** | **6.3** | 6.1     | **6.5**  |
+| **MGM-Omni-32B** | 2025-08 | 7.1     | **6.5** | 6.2     | **6.2** | **6.5**  |
+This table presents evaluation results on AIR-Bench Chat (speech, sound, music, etc.).
+### Speech Generation
+| Model           | Date    | Model Size | CER↓     | SS(ZH)↑   | WER↓     | SS(EN)↑   |
+|:----------------|:--------|:-----------|:---------|:----------|:---------|:----------|
+| CosyVoice2      | 2024-12 | 0.5B       | 1.45     | 0.748     | 2.57     | 0.652     |
+| Qwen2.5-Omni-3B | 2025-03 | 0.5B       | 1.58     | 0.744     | 2.51     | 0.635     |
+| Qwen2.5-Omni-7B | 2025-03 | 2B         | 1.42     | 0.754     | 2.33     | 0.641     |
+| MOSS-TTSD-v0    | 2025-06 | 2B         | 2.18     | 0.594     | 2.46     | 0.476     |
+| HiggsAudio-v2   | 2025-07 | 6B         | 1.66     | 0.743     | 2.44     | 0.677     |
+| **MGM-Omni**    | 2025-08 | 0.6B       | 1.49     | 0.749     | 2.54     | 0.670     |
+| **MGM-Omni**    | 2025-08 | 2B         | 1.38     | 0.753     | 2.28     | 0.682     |
+| **MGM-Omni**    | 2025-08 | 4B         | **1.34** | **0.756** | **2.22** | **0.684** |
+This table presents evaluation results on speech generation on seed-tts-eval.
+For Qwen2.5-Omni, model size refers to the size of the talker.
+## Citation
+If you find this repo useful for your research, we would appreciate it if you could cite our work:
+```
+@misc{wang2025mgmomni,
+  title={MGM-Omni: An Open-source Omni Chatbot},
+  author={Wang, Chengyao and Zhong, Zhisheng and Peng, Bohao and Yang, Senqiao and Liu, Yuqi and Yu, Bei and Jia, Jiaya},
+  year={2025},
+  howpublished={\url{https://mgm-omni.notion.site}},
+  note={Notion Blog}
+}
+```