---
license: apache-2.0
pipeline_tag: any-to-any
---
# Ming-UniAudio
📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope
## Introduction
Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon [Ming-Lite-Omni](https://github.com/inclusionAI/Ming). Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification.
- 🔥 First unified continuous speech tokenizer for both understanding and generation tasks: [MingTok-Audio](https://github.com/inclusionAI/MingTok-Audio)
- 🔥 First Speech LLM  with unifed continuous tokenizer for both understanding and generation: [Ming-UniAudio](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B)
- 🔥 First universal free-form speech editing model for various semantic and acoustic editing task without any temporal regime: [Ming-UniAudio-Edit](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B-Edit)
- 🔥 First benchmark for free-form speech editing: [Ming-Freeform-Audio-Edit-Benchmark](https://huggingface.co/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark)
     
## 📌 Updates
* [2025.09.30] 🔥 We release [Ming-UniAudio](https://xqacmer.github.io/Ming-Unitok-Audio.github.io/) with significant improvements across speech understanding, generation, and free-form editing tasks.
## Key Features
Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs:
- **Unified Continuous Speech Tokenizer**: Ming-UniAudio proposes a unified continuous speech tokenizer [MingTok-Audio](https://github.com/inclusionAI/MingTok-Audio) based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks
- **Unified Speech Language Model for Generation and Understanding**: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-fidelity speech synthesis.
- **Instruction-Guided Free-Form Speech Editing**: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with [Ming-Freeform-Audio-Edit](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark), the first open-source evaluation set for such tasks.
##  Evaluation
In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale.
### Speech Understanding
  ASR performance comparison on various audio benchmark datasets. The best results are in bold.
  
    
      | Datasets | Model | Performance | 
    
      | aishell2-ios | LS-clean | Hunan | Minnan | Guangyue | Chuanyu | Shanghai | 
  
  
    
      | Understanding ASR | Kimi-Audio | 2.56 | 1.28 | 31.93 | 80.28 | 41.49 | 6.69 | 60.64 | 
    
      | Qwen2.5 Omni | 2.75 | 1.80 | 29.31 | 53.43 | 10.39 | 7.61 | 32.05 | 
    
      | Qwen2 Audio | 2.92 | 1.60 | 25.88 | 123.78 | 7.59 | 7.77 | 31.73 | 
    
      | Ming-UniAudio-16B-A3B(ours) | 2.84 | 1.62 | 9.80 | 16.50 | 5.51 | 5.46 | 14.65 | 
  
### Speech Generation
Performance comparison on various audio benchmark datasets. The best results are in  bold.
  
    
      | Datasets | Model | Performance | 
    
      |  |  | Seed-zh WER(%) | Seed-zh SIM | Seed-en WER(%) | Seed-en SIM | 
  
  
    
      | Generation | Seed-TTS | 1.12 | 0.80 | 2.25 | 0.76 | 
    
      | MiMo-Audio | 1.96 | - | 5.37 | - | 
    
      | Qwen3-Omni-30B-A3B-Instruct | 1.07 | - | 1.39 | - | 
    
      | Ming-Omni-Lite | 1.69 | 0.68 | 4.31 | 0.51 | 
    
      | Ming-UniAudio-16B-A3B(ours) | 0.95 | 0.70 | 1.85 | 0.58 | 
  
## Model & Benchmark Downloads
You can download our latest model and Benchmark from both Huggingface and ModelScope.
|**Type**| **Model**              |   **Input modality**   | **Oput modality** |                                                                         **Download**                                                                         |
|:-----------------------|:-----------------------|:----------------------:| :---------------: |:------------------------------------------------------------------------------------------------------------------------------------------------------------:|
Tokenizer| MingTok-Audio | audio | audio  | [🤗 HuggingFace](https://huggingface.co/inclusionAI/MingTok-Audio) 
[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/MingTok-Audio) |
SpeechLLM| Ming-UniAudio-16B-A3B     | audio | audio  | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B) 
[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B) |
SpeechLLM| Ming-UniAudio-16B-A3B-Edit     | text, audio | text, audio  | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B-Edit) 
[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B-Edit) |
Benchmark| Ming-Freeform-Audio-Edit     | - | -  | [🤗 HuggingFace](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark) 
[🤖 ModelScope](https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark) 
[Eval tools](https://github.com/inclusionAI/Ming-Freeform-Audio-Edit)|
If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.
```
pip install modelscope
modelscope download --model inclusionAI/Ming-UniAudio-16B-A3B --local_dir inclusionAI/Ming-UniAudio-16B-A3B  --revision master
```
Note: This download process will take several minutes to several hours, depending on your network conditions.
## Use Cases
Additional demonstration cases are available on our project [page](https://xqacmer.github.io/Ming-Unitok-Audio.github.io/).
## Environment Preparation
### Installation with pip
```shell
pip install -r requirements.txt
```
### Installation with docker
You can also initialize the environment by building the docker image. First clone this repository:
```shell
git clone --depth 1 https://github.com/inclusionAI/Ming-UniAudio
cd Ming-UniAudio
```
Then build the docker image with the provided Dockerfile in `docker/docker-py310-cu121`. This step might take a while:
```shell
docker build -t ming:py310-cu121 docker/docker-py310-cu121
```
At last, start the container with the current repo directory mounted:
```shell
docker run -it --gpus all -v "$(pwd)":/workspace/Ming-UniAudio ming:py310-cu121 ming:py310-cu121 /bin/bash
```
You can run the model with python interface. You may download the huggingface model in the repo directory first (`.../Ming-UniAudio/`) or mount the downloaded model path when starting the container.
## Example Usage
We provide a step-by-step running example:
Step 1 - Download the source code
```
git clone	https://github.com/inclusionAI/Ming-UniAudio
cd Ming-UniAudio
```
Step 2 - Download the Ming-UniAudio model weights and create a soft link to the source code directory
Download our model following `Model & Benchmark Downloads`
```shell
mkdir inclusionAI 
ln -s /path/to/inclusionAI/Ming-UniAudio-16B-A3B inclusionAI/Ming-UniAudio-16B-A3B
```
Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-UniAudio model.
```shell
jupyter notebook cookbooks/demo.ipynb
```
We also provide a simple example on the usage of this repo. For detailed usage, please refer to [demobook.ipynb](https://github.com/inclusionAI/Ming-UniAudio/blob/main/cookbooks/demo.ipynb).
```python
import warnings
import torch
from transformers import AutoProcessor
from modeling_bailingmm import BailingMMNativeForConditionalGeneration
import random
import numpy as np
from loguru import logger
def seed_everything(seed=1895):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
seed_everything()
warnings.filterwarnings("ignore")
class MingAudio:
    def __init__(self, model_path, device="cuda:0"):
        self.device = device
        self.model = BailingMMNativeForConditionalGeneration.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16,
            low_cpu_mem_usage=True,
        ).eval().to(torch.bfloat16).to(self.device)
        self.processor = AutoProcessor.from_pretrained(".", trust_remote_code=True)
        self.tokenizer = self.processor.tokenizer
        self.sample_rate = self.processor.audio_processor.sample_rate
        self.patch_size = self.processor.audio_processor.patch_size
        
    def speech_understanding(self, messages):
        text = self.processor.apply_chat_template(messages, add_generation_prompt=True)
        image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages)
        inputs = self.processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            audios=audio_inputs,
            return_tensors="pt",
        ).to(self.device)
        for k in inputs.keys():
            if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
                inputs[k] = inputs[k].to(dtype=torch.bfloat16)
        logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}")
        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=512,
            eos_token_id=self.processor.gen_terminator,
        )
        generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        output_text = self.processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )[0]
        return output_text
    def speech_generation(
        self, 
        text,
        prompt_wav_path,
        prompt_text,
        lang='zh',
        output_wav_path='out.wav'
    ):
        waveform = self.model.generate_tts(
            text=text,
            prompt_wav_path=prompt_wav_path,
            prompt_text=prompt_text,
            patch_size=self.patch_size,
            tokenizer=self.tokenizer,
            lang=lang,
            output_wav_path=output_wav_path,
            sample_rate=self.sample_rate,
            device=self.device
        )
        
        return waveform
    def speech_edit(
        self, 
        messages,
        output_wav_path='out.wav'
    ):
        text = self.processor.apply_chat_template(messages, add_generation_prompt=True)
        image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages)
        inputs = self.processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            audios=audio_inputs,
            return_tensors="pt",
        ).to(self.device)
        ans = torch.tensor([self.tokenizer.encode('')]).to(inputs['input_ids'].device)
        inputs['input_ids'] = torch.cat([inputs['input_ids'], ans], dim=1)
        attention_mask = inputs['attention_mask']
        inputs['attention_mask'] = torch.cat((attention_mask, attention_mask[:, :1]), dim=-1)
        for k in inputs.keys():
            if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
                inputs[k] = inputs[k].to(dtype=torch.bfloat16)
        logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}")
        edited_speech, edited_text = self.model.generate_edit(
            **inputs,
            tokenizer=self.tokenizer,
            output_wav_path=output_wav_path
        )
        return edited_speech, edited_text
if __name__ == "__main__":
    model = MingAudio("inclusionAI/Ming-UniAudio-16B-A3B")
    
    # ASR
    messages = [
        {
            "role": "HUMAN",
            "content": [
                {
                    "type": "text",
                    "text": "Please recognize the language of this speech and transcribe it. Format: oral.",
                },
                
                {"type": "audio", "audio": "data/wavs/BAC009S0915W0292.wav"},
            ],
        },
    ]
    
    response = model.speech_understanding(messages=messages)
    logger.info(f"Generated Response: {response}")
    # TTS
    model.speech_generation(
        text='我们的愿景是构建未来服务业的数字化基础设施,为世界带来更多微小而美好的改变。',
        prompt_wav_path='data/wavs/10002287-00000094.wav',
        prompt_text='在此奉劝大家别乱打美白针。',
    )
```
Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.
## Citation
If you find our work helpful, feel free to give us a cite.