mistralai
/

Voxtral-Small-24B-2507

@@ -21,21 +21,19 @@ pipeline_tag: audio-text-to-text
 # Voxtral Small 1.0 (24B) - 2507
-Voxtral Small is an enhancement of [Mistral Small 3](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501), incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription and understanding.
 Learn more about Voxtral in our blog post [here](https://mistral.ai/news/voxtral-2507).
-Both Voxtral models go beyond transcription with capabilities that include:
 ## Key Features
 Voxtral builds upon Mistral Small 3 with powerful audio understanding capabilities.
-- **Long-form context**: with a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
-- **Built-in Q&A and summarization**: Supports asking questions directly about the audio content or generating structured summaries, without the need to chain separate ASR and language models
-- **Natively multilingual**: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian, to name a few), helping teams serve global audiences with a single system
-- **Function-calling straight from voice**: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents, turning voice interactions into actionable system commands without intermediate parsing steps.
-- **Highly capable at text**: Retains the text understanding capabilities of its language model backbone, Mistral Small 3
 ## Benchmark Results

 # Voxtral Small 1.0 (24B) - 2507
+Voxtral Small is an enhancement of [Mistral Small 3](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501), incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.
 Learn more about Voxtral in our blog post [here](https://mistral.ai/news/voxtral-2507).
 ## Key Features
 Voxtral builds upon Mistral Small 3 with powerful audio understanding capabilities.
+- **Dedicated transcription mode**: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
+- **Long-form context**: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
+- **Built-in Q&A and summarization**: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
+- **Natively multilingual**: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
+- **Function-calling straight from voice**: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
+- **Highly capable at text**: Retains the text understanding capabilities of its language model backbone, Mistral Small 3.1
 ## Benchmark Results