mistralai
/

Voxtral-Small-24B-2507

@@ -56,9 +56,11 @@ Voxtral builds upon Mistral Small 3 with powerful audio understanding capabiliti
 The model can be used with the following frameworks;
 - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
-**Note 1**: We recommend using a relatively low temperature, such as `temperature=0.15`.
-**Note 2**: Make sure to add a system prompt to the model to best tailor it to your needs.
 ### vLLM (recommended)
@@ -66,20 +68,34 @@ We recommend using this model with [vLLM](https://github.com/vllm-project/vllm).
 #### Installation
-Make sure to install [`vLLM >= 0.#.#`](https://github.com/vllm-project/vllm/releases/tag/v0.#.#):
 ```
-pip install vllm --upgrade
 ```
-Doing so should automatically install [`mistral_common >= 1.#.#`](https://github.com/mistralai/mistral-common/releases/tag/v1.#.#).
 To check:
 ```
 python -c "import mistral_common; print(mistral_common.__version__)"
 ```
-You can also make use of a ready-to-go [docker image](https://github.com/vllm-project/vllm/blob/main/Dockerfile) or on the [docker hub](https://hub.docker.com/layers/vllm/vllm-openai/latest/images/sha256-de9032a92ffea7b5c007dad80b38fd44aac11eddc31c435f8e52f3b7404bbf39).
 #### Serve
@@ -88,7 +104,7 @@ We recommend that you use Voxtral-Small-24B-2507 in a server/client setting.
 1. Spin up a server:
 ```
-vllm serve mistralai/Voxtral-Small-24B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --tensor-parallel-size 2
 ```
 **Note:** Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
@@ -105,56 +121,6 @@ Leverage the audio capabilities of Voxtral-Small-24B-2507 to chat.
   <summary>Python snippet</summary>
 ```py
-TODO
-```
-</details>
-#### Transcription
-Voxtral-Small-24B-2507 has powerfull transcription capabilities!
-<details>
-  <summary>Python snippet</summary>
-```python
-TODO
-```
-</details>
-#### Function calling
-Voxtral-Small-24B-2507 is excellent at function / tool calling tasks via vLLM. *E.g.:*
-<details>
-  <summary>Python snippet</summary>
-```py
-```
-</details>
-# ORIGINAL
-```
-VLLM_USE_PRECOMPILED=1 pip install --editable .\[audio\]
-```
-of: https://github.com/vllm-project/vllm/pull/20970#pullrequestreview-3019578541
-# Examples
-## Client/Server
-### Server
-```sh
-vllm serve mistralai/voxtral-small --tokenizer_mode mistral --config_format mistral --load_format mistral --max_model_len 32768
-```
-### Client - Chat
-```py
-#!/usr/bin/env python3
 from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
 from mistral_common.audio import Audio
 from huggingface_hub import hf_hub_download
@@ -163,7 +129,7 @@ from openai import OpenAI
 # Modify OpenAI's API key and API base to use vLLM's API server.
 openai_api_key = "EMPTY"
-openai_api_base = "http://slurm-h100-reserved-rno-199-087:8000/v1"
 client = OpenAI(
     api_key=openai_api_key,
@@ -220,10 +186,16 @@ content = response.choices[0].message.content
 print(30 * "=" + "BOT 2" + 30 * "=")
 print(content)
 ```
-### Client - Transcribe
-```py
 from mistral_common.protocol.transcription.request import TranscriptionRequest
 from mistral_common.protocol.instruct.messages import RawAudio
 from mistral_common.audio import Audio
@@ -233,7 +205,7 @@ from openai import OpenAI
 # Modify OpenAI's API key and API base to use vLLM's API server.
 openai_api_key = "EMPTY"
-openai_api_base = "http://slurm-h100-reserved-rno-199-087:8000/v1"
 client = OpenAI(
     api_key=openai_api_key,
@@ -252,5 +224,4 @@ req = TranscriptionRequest(model=model, audio=audio, language="en").to_openai(ex
 response = client.audio.transcriptions.create(**req)
 print(response)
 ```

 The model can be used with the following frameworks;
 - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
+**Recommended settings**:
+- `temperature=0.2` and `top_p=0.95` for chat completion (*e.g. Audio Understanding*) and `temperature=0.0` for transcription
+- Multiple audios per message and multiple user turns with audio are supported
+- System prompts are not yet supported
 ### vLLM (recommended)
 #### Installation
+Make sure to install vllm from "main":
 ```
+pip install -U vllm[audio] \
+    --pre \
+    --extra-index-url https://wheels.vllm.ai/nightly
 ```
+Doing so should automatically install [`mistral_common >= 1.8.0`](https://github.com/mistralai/mistral-common/releases/tag/v1.8.0).
 To check:
 ```
 python -c "import mistral_common; print(mistral_common.__version__)"
 ```
+#### Offline
+You can test that your vLLM setup works as expected by cloning the vLLM repo:
+```sh
+git clone https://github.com/vllm-project/vllm && cd vllm
+```
+and then running:
+```sh
+python examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral
+```
 #### Serve
 1. Spin up a server:
 ```
+vllm serve mistralai/Voxtral-Small-24B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral --tensor-parallel-size 2
 ```
 **Note:** Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
   <summary>Python snippet</summary>
 ```py
 from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
 from mistral_common.audio import Audio
 from huggingface_hub import hf_hub_download
 # Modify OpenAI's API key and API base to use vLLM's API server.
 openai_api_key = "EMPTY"
+openai_api_base = "http://<your-server-host>:8000/v1"
 client = OpenAI(
     api_key=openai_api_key,
 print(30 * "=" + "BOT 2" + 30 * "=")
 print(content)
 ```
+</details>
+#### Transcription
+Voxtral-Small-24B-2507 has powerful transcription capabilities!
+<details>
+  <summary>Python snippet</summary>
+```python
 from mistral_common.protocol.transcription.request import TranscriptionRequest
 from mistral_common.protocol.instruct.messages import RawAudio
 from mistral_common.audio import Audio
 # Modify OpenAI's API key and API base to use vLLM's API server.
 openai_api_key = "EMPTY"
+openai_api_base = "http://<your-server-host>:8000/v1"
 client = OpenAI(
     api_key=openai_api_key,
 response = client.audio.transcriptions.create(**req)
 print(response)
 ```
+</details>