Step-Audio-EditX

✨ Demo Page  | 🌟 GitHub  | πŸ“‘ Paper 

Check our open-source repository https://github.com/stepfun-ai/Step-Audio-EditX for more details!

We are open-sourcing Step-Audio-EditX, a powerful 3B parameters LLM-based audio model specialized in expressive and iterative audio editing. It excels at editing emotion, speaking style, and paralinguistics, and also features robust zero-shot text-to-speech (TTS) capabilities.

Features

  • Zero-Shot TTS

    • Excellent zero-shot TTS cloning for Mandarin, English, Sichuanese, and Cantonese.
    • To use a dialect, just add a [Sichuanese] or [Cantonese] tag before your text.
  • Emotion and Speaking Style Editing

    • Remarkably effective iterative control over emotions and styles, supporting dozens of options for editing.
      • Emotion Editing : [ Angry, Happy, Sad, Excited, Fearful, Surprised, Disgusted, etc. ]
      • Speaking Style Editing: [ Act_coy, Older, Child, Whisper, Serious, Generous, Exaggerated, etc.]
      • Editing with more emotion and more speaking styles is on the way. Get Ready! πŸš€
  • Paralinguistic Editing:

    • Precise control over 10 types of paralinguistic features for more natural, human-like, and expressive synthetic audio.
    • Supporting Tags:
      • [ Breathing, Laughter, Suprise-oh, Confirmation-en, Uhm, Suprise-ah, Suprise-wa, Sigh, Question-ei, Dissatisfaction-hnn ]

For more examples, see demo page.

Model Usage

πŸ“œ Requirements

The following table shows the requirements for running Step-Audio-EditX model:

Model Parameters Setting
(sample frequency)
GPU Optimal Memory
Step-Audio-EditX 3B 41.6Hz 32 GB
  • An NVIDIA GPU with CUDA support is required.
    • The model is tested on a single L40S GPU.
  • Tested operating system: Linux

πŸ”§ Dependencies and Installation

git clone https://github.com/stepfun-ai/Step-Audio-EditX.git
conda create -n stepaudioedit python=3.10
conda activate stepaudioedit

cd Step-Audio-EditX
pip install -r requirements.txt

git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-Tokenizer
git clone https://huggingface.co/stepfun-ai/Step-Audio-EditX

After downloading the models, where_you_download_dir should have the following structure:

where_you_download_dir
β”œβ”€β”€ Step-Audio-Tokenizer
β”œβ”€β”€ Step-Audio-EditX

Run with Docker

You can set up the environment required for running Step-Audio using the provided Dockerfile.

# build docker
docker build . -t step-audio-editx

# run docker
docker run --rm --gpus all \
    -v /your/code/path:/app \
    -v /your/model/path:/model \
    -p 7860:7860 \
    step-audio-editx

Launch Web Demo

Start a local server for online inference. Assume you have one GPU with at least 32GB memory available and have already downloaded all the models.

# Step-Audio-EditX demo
python app.py --model-path where_you_download_dir --model-source local

Local Inference Demo

For optimal performance, keep audio under 30 seconds per inference.

# zero-shot cloning
python3 tts_infer.py \
    --model-path where_you_download_dir \
    --output-dir ./output \
    --prompt-text "your prompt text"\
    --prompt-audio your_prompt_audio_path \
    --generated-text "your target text" \
    --edit-type "clone"

# edit
python3 tts_infer.py \
    --model-path where_you_download_dir \
    --output-dir ./output \
    --prompt-text "your promt text" \
    --prompt-audio your_prompt_audio_path \
    --generated-text "" \ # for para-linguistic editing, you need to specify the generatedd text
    --edit-type "emotion" \
    --edit-info "sad" \
    --n-edit-iter 2

Citation

@misc{yan2025stepaudioeditxtechnicalreport,
      title={Step-Audio-EditX Technical Report}, 
      author={Chao Yan and Boyong Wu and Peng Yang and Pengfei Tan and Guoqiang Hu and Yuxin Zhang and Xiangyu and Zhang and Fei Tian and Xuerui Yang and Xiangyu Zhang and Daxin Jiang and Gang Yu},
      year={2025},
      eprint={2511.03601},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.03601}, 
}
Downloads last month
104
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using stepfun-ai/Step-Audio-EditX 1

Collection including stepfun-ai/Step-Audio-EditX