You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Step-Audio-R1

Demo Page  | 🎮 Playground  | 🌟 GitHub  | 📑 Paper 

Step-Audio-R1 is the first audio language model to successfully unlock Chain-of-Thought (CoT) reasoning. It decisively solves the "inverted scaling" problem that plagues existing models, where performance degrades with longer reasoning. Step-Audio-R1 is the first model to demonstrate that for audio, like text and vision, allocating more compute at test-time predictably improves performance.

We found the root cause of this anomaly: models were engaging in textual surrogate reasoning (analyzing transcripts, not audio) due to a modality mismatch. To solve this, we introduce Modality-Grounded Reasoning Distillation (MGRD), an iterative training framework that shifts the model's reasoning from textual abstractions to acoustic properties.

This new approach allows us to create Step-Audio-R1, which:

  • Is the first audio reasoning model that successfully benefits from test-time compute scaling.
  • Surpasses Gemini 2.5 Pro and is comparable to Gemini 3 across major audio reasoning tasks.
  • Transforms extended deliberation from a liability into a powerful asset for audio intelligence.

Features

  • Chain-of-Thought (CoT) Reasoning

    • First audio language model to successfully unlock Chain-of-Thought reasoning capabilities.
    • Generates audio-relevant reasoning chains that genuinely ground themselves in acoustic features.
  • Modality-Grounded Reasoning Distillation (MGRD)

    • Innovative iterative training framework that shifts reasoning from textual abstractions to acoustic properties.
    • Solves the modality mismatch problem that caused textual surrogate reasoning in previous models.
  • Superior Performance

    • Surpasses Gemini 2.5 Pro across comprehensive audio understanding and reasoning benchmarks.
    • Comparable to Gemini 3 across major audio reasoning tasks.
    • Surpasses Qwen3 in textual reasoning.
    • Covers speech, environmental sounds, and music domains.

For more examples, see demo page.

Model Usage

📜 Requirements

  • GPU: NVIDIA GPUs with CUDA support (tested on 4×L40S/H100/H800/H20).
  • Operating System: Linux.
  • Python: >= 3.10.0.

⬇️ Download Model

First, you need to download the Step-Audio-R1 model weights.

Method A · Git LFS

git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-R1

Method B · Hugging Face CLI

hf download stepfun-ai/Step-Audio-R1 --local-dir ./Step-Audio-R1

🚀 Deployment and Execution

We provide two ways to serve the model: Docker (recommended) or compiling the customized vLLM backend.

🐳 Method 1 · Run with Docker (Recommended)

A customized vLLM image is required.

  1. Pull the image:
docker pull stepfun2025/vllm:step-audio-2-v20250909
  1. Start the service: Assuming the model is downloaded in the Step-Audio-R1 folder in the current directory.

    docker run --rm -ti --gpus all \
        -v $(pwd)/Step-Audio-R1:/Step-Audio-R1 \
        -p 9999:9999 \
        stepfun2025/vllm:step-audio-2-v20250909 \
        -- vllm serve /Step-Audio-R1 \
        --served-model-name Step-Audio-R1 \
        --port 9999 \
        --max-model-len 16384 \
        --max-num-seqs 32 \
        --tensor-parallel-size 4 \
        --chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}' \
        --enable-log-requests \
        --interleave-mm-strings \
        --trust-remote-code
    

After the service starts, it will listen on localhost:9999.

🐳 Method 2 · Run from Source (Compile vLLM)

Step-Audio-R1 requires a customized vLLM backend.

  1. Download Source Code:

    git clone https://github.com/stepfun-ai/vllm.git
    cd vllm
    
  2. Prepare Environment:

    python3 -m venv .venv
    source .venv/bin/activate
    
  3. Install and Compile: vLLM contains both C++ and Python code. We mainly modified the Python code, so the C++ part can use the pre-compiled version to speed up the process.

    # Use pre-compiled C++ extensions (Recommended)
    VLLM_USE_PRECOMPILED=1 pip install -e .
    
  4. Switch Branch: After compilation, switch to the branch that supports Step-Audio.

    git checkout step-audio-2-mini
    
  5. Start the Service:

    # Ensure you are in the vllm directory and the virtual environment is activated
    source .venv/bin/activate
    
    python3 -m vllm.entrypoints.openai.api_server \
        --model ../Step-Audio-R1 \
        --served-model-name Step-Audio-R1 \
        --port 9999 \
        --host 0.0.0.0 \
        --max-model-len 65536 \
        --max-num-seqs 128 \
        --tensor-parallel-size 4 \
        --gpu-memory-utilization 0.85 \
        --trust-remote-code \
        --enable-log-requests \
        --interleave-mm-strings \
        --chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}'
    

After the service starts, it will listen on localhost:9999.

🧪 Client Examples

Get the example code and run it:

# Clone the repository containing example scripts
git clone https://github.com/stepfun-ai/Step-Audio-R1.git r1-scripts

# Run the example
cd r1-scripts
python examples-vllm_r1.py

Citation

@article{tian2025step,
  title={Step-Audio-R1 Technical Report},
  author={Tian, Fei and Zhang, Xiangyu Tony and Zhang, Yuxin and Zhang, Haoyang and Li, Yuxin and Liu, Daijiao and Deng, Yayue and Wu, Donghang and Chen, Jun and Zhao, Liang and others},
  journal={arXiv preprint arXiv:2511.15848},
  year={2025}
}
Downloads last month
96
Safetensors
Model size
33B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using stepfun-ai/Step-Audio-R1 1

Collection including stepfun-ai/Step-Audio-R1