Voxtral Mini 1.0 (3B) - 2507
Voxtral Mini is an enhancement of Ministral 3B, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.
Learn more about Voxtral in our blog post here.
Key Features
Voxtral builds upon Ministral-3B with powerful audio understanding capabilities.
Dedicated transcription mode : Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly Long-form context : With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding Built-in Q&A and summarization : Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models Natively multilingual : Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian) Function-calling straight from voice : Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents Highly capable at text: Retains the text understanding capabilities of its language model backbone, Ministral-3B
Benchmark Results
Audio
Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:
Text
Usage
The model can be used with the following frameworks;
Notes:
temperature=0.2 and top_p=0.95 for chat completion (e.g. Audio Understanding) and temperature=0.0 for transcription
and for chat completion (e.g. Audio Understanding) and for transcription Multiple audios per message and multiple user turns with audio are supported
System prompts are not yet supported
vLLM (recommended)
We recommend using this model with vLLM.
Installation
Make sure to install vllm from "main", we recommend using uv :
uv pip install -U "vllm[audio]" --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
Doing so should automatically install mistral_common >= 1.8.0 .
To check:
python -c "import mistral_common; print(mistral_common.__version__)"
Offline
You can test that your vLLM setup works as expected by cloning the vLLM repo:
git clone https://github.com/vllm-project/vllm && cd vllm
and then running:
python examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral
Serve
We recommend that you use Voxtral-Small-24B-2507 in a server/client setting.
Spin up a server:
vllm serve mistralai/Voxtral-Mini-3B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral
Note: Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.
To ping the client you can use a simple Python snippet. See the following examples.
Audio Instruct
Leverage the audio capabilities of Voxtral-Mini-3B-2507 to chat.
Make sure that your client has mistral-common with audio installed:
pip install --upgrade mistral_common\[audio\]
Python snippet from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio from mistral_common.audio import Audio from huggingface_hub import hf_hub_download from openai import OpenAI openai_api_key = "EMPTY" openai_api_base = "http://:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models. list () model = models.data[ 0 ]. id obama_file = hf_hub_download( "patrickvonplaten/audio_samples" , "obama.mp3" , repo_type= "dataset" ) bcn_file = hf_hub_download( "patrickvonplaten/audio_samples" , "bcn_weather.mp3" , repo_type= "dataset" ) def file_to_chunk ( file: str ) -> AudioChunk: audio = Audio.from_file(file, strict= False ) return AudioChunk.from_audio(audio) text_chunk = TextChunk(text= "Which speaker is more inspiring? Why? How are they different from each other?" ) user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai() print ( 30 * "=" + "USER 1" + 30 * "=" ) print (text_chunk.text) print ( "
" ) response = client.chat.completions.create( model=model, messages=[user_msg], temperature= 0.2 , top_p= 0.95 , ) content = response.choices[ 0 ].message.content print ( 30 * "=" + "BOT 1" + 30 * "=" ) print (content) print ( "
" ) messages = [ user_msg, AssistantMessage(content=content).to_openai(), UserMessage(content= "Ok, now please summarize the content of the first audio." ).to_openai() ] print ( 30 * "=" + "USER 2" + 30 * "=" ) print (messages[- 1 ][ "content" ]) print ( "
" ) response = client.chat.completions.create( model=model, messages=messages, temperature= 0.2 , top_p= 0.95 , ) content = response.choices[ 0 ].message.content print ( 30 * "=" + "BOT 2" + 30 * "=" ) print (content)
Transcription
Voxtral-Mini-3B-2507 has powerful transcription capabilities!
Make sure that your client has mistral-common with audio installed:
pip install --upgrade mistral_common\[audio\]