Pure C, CPU-only inference with Mistral Voxtral Realtime 4B speech to text model

Voxtral Realtime 4B Pure C Implementation

This is a C implementation of the inference pipeline for the Mistral AI's Voxtral Realtime 4B model. It has zero external dependencies beyond the C standard library. The MPS inference is decently fast, while the BLAS acceleration is usable but slow (it continuously convert the bf16 weights to fp32).

Audio processing uses a chunked encoder with overlapping windows, bounding memory usage regardless of input length. Audio can also be piped from stdin ( --stdin ), or captured live from the microphone ( --from-mic , macOS), making it easy to transcode and transcribe any format via ffmpeg. A streaming C API ( vox_stream_t ) lets you feed audio incrementally and receive token strings as they become available.

More testing needed: please note that this project was mostly tested against few samples, and likely requires some more work to be production quality. However the hard part, to understand the model inference and reproduce the inference pipeline, is here, so the rest likely can be done easily. Testing it against very long transcriptions, able to stress the KV cache circular buffer, will be a useful task.

Motivations (and some rant)

Thank you to Mistral for releasing such a great model in an Open Weights fashion. However, the author of this project believes that limiting the inference to a partnership with vLLM, without providing a self-contained reference implementation in Python, limits the model's actual reach and the potential good effects it could have. For this reason, this project was created: it provides both a pure C inference engine and a simple, self-contained Python reference implementation ( python_simple_implementation.py ) that anyone can read and understand without digging through the vLLM codebase.

Quick Start

# Build (choose your backend) make mps # Apple Silicon (fastest) # or: make blas # Intel Mac / Linux with OpenBLAS # Download the model (~8.9GB) ./download_model.sh # Transcribe audio (tokens stream to stdout as generated) ./voxtral -d voxtral-model -i audio.wav # Live microphone transcription (macOS, Ctrl+C to stop) ./voxtral -d voxtral-model --from-mic # Pipe any format via ffmpeg ffmpeg -i audio.mp3 -f s16le -ar 16000 -ac 1 - 2> /dev/null | \ ./voxtral -d voxtral-model --stdin # Real-time streaming with low latency ffmpeg -i audio.mp3 -f s16le -ar 16000 -ac 1 - 2> /dev/null | \ ./voxtral -d voxtral-model --stdin -I 0.5

That's it. No Python runtime, no CUDA toolkit, no mistral_common or vLLM required at inference time.

Python Reference Implementation

... continue reading