Mistral AI Unveils Voxtral

Mistral AI has introduced Voxtral, a new open-source speech understanding model that is poised to compete with established proprietary APIs. Voxtral is available in two versions: a 24 billion parameter model and a 3 billion parameter model. This article will focus on the capabilities and a local demonstration of the 3 billion parameter version.
Voxtral boasts a number of impressive features, including the ability to process long-form audio of up to 40 minutes, built-in question answering and summarization, multilingual support with automatic language detection, and function calling directly from voice input. It is an upgrade to Mistral’s previous model and retains the text understanding capabilities of Mistral Small 3.1. Benchmarks show that Voxtral is competitive in speech transcription, audio understanding, and translation, particularly in multilingual scenarios.
A demonstration of the 3 billion parameter model showcased its capabilities. The model was installed locally on an Ubuntu system with an Nvidia RTX A6000 GPU and served using the vLLM
inference engine. The demonstration highlighted the model’s ability to transcribe and analyze an English audio file, providing a full transcription, analysis, key themes, and a summary. The multilingual capabilities were also tested with audio files in Spanish, French, Portuguese, Hindi, and German. While the model successfully translated the audio to English, it did not always output the original language. A function calling demonstration successfully converted a voice command to generate a new UUID into an API call, which then returned a unique identifier.
While the 3 billion parameter model has a significant VRAM footprint, its impressive quality and significant improvements over previous models make it a powerful new tool for developers.