Artificial IntelligenceJun 22, 20267 min read

Voxtral: Mistral releases open-weight voice AI — speech generation and transcription for your own server

With Voxtral, the European AI lab Mistral has released two open-weight speech models: text-to-speech with voice cloning and real-time transcription — both deployable on your own infrastructure, GDPR-friendly, and significantly cheaper than proprietary alternatives such as ElevenLabs.

Voxtral: Mistral releases open-weight voice AI — speech generation and transcription for your own server — Artificial Intelligence

Many mid-sized companies are exploring voice AI. It makes sense for automated meeting minutes, spoken output in customer-facing applications or accessible documentation. The barrier has consistently been the same: dependence on US cloud APIs, with all the questions about data protection and data sovereignty that entails. In early 2026, the French AI lab Mistral changed that with its Voxtral model family — two open-weight speech models that can be run on your own hardware, free of charge.

Two models, two directions: what Voxtral can do

The Voxtral family covers both directions of voice processing. Voxtral TTS (text-to-speech) was released on 26 March 2026; Voxtral Transcribe 2 (speech-to-text) followed on 4 February 2026.

  • Voxtral TTS converts text into natural speech in nine languages, including German, French, Spanish, Italian and Dutch. The model ships with 20 built-in preset voices. Via zero-shot voice cloning, a custom voice can be derived from as little as three seconds of reference audio — without training a separate model. API pricing: $0.016 per 1,000 characters. In Mistral’s own evaluations, the model achieves a 68.4% win rate against ElevenLabs Flash v2.5 in multilingual voice cloning.
  • Voxtral Transcribe 2 transcribes speech in real time or in batch mode across 13 languages. Built-in speaker diarisation identifies who said what — without a separate service. Word-level timestamps allow for precise subtitle generation and audio search. API pricing: $0.003 per minute — a one-hour meeting costs $0.18 to transcribe.
  • Open model weights: both models are available for download on Hugging Face. Voxtral TTS is released under CC BY-NC 4.0; the real-time transcription model under Apache 2.0. Self-hosted deployment requires a GPU with at least 16 GB of VRAM (TTS model in BF16 format); with quantization, the model runs in approximately 3 GB of RAM.

Why open-weight matters for data protection

The crucial difference from proprietary providers such as ElevenLabs or OpenAI TTS lies in the operating model: whoever deploys Voxtral locally sends no audio data to external servers. For companies in regulated industries — healthcare, finance, law, or wherever confidential conversations are processed — this is not a nice-to-have but a prerequisite. As a European company, Mistral also brings a different regulatory baseline than US providers. A structured AI integration can map exactly which data flows are permissible for your use case and whether local operation is necessary.

One licence nuance is worth noting: the CC BY-NC 4.0 licence for Voxtral TTS model weights excludes commercial use in self-hosted deployments. Companies that want to use Voxtral TTS commercially on their own infrastructure need to clarify terms with Mistral directly. The API offering has no such restriction. Voxtral Realtime for transcription, available under Apache 2.0, permits commercial use freely.

Practical use cases for mid-sized companies

Voice AI is no longer a topic reserved for large corporates. With Voxtral, it becomes accessible for mid-sized companies without an in-house AI team. The following use cases are realistic starting points for a pilot:

  • Automatic meeting minutes: Voxtral Transcribe 2 records a conversation, identifies speakers automatically and delivers a structured transcript without manual effort. At $0.003 per minute, a ten-hour weekly meeting load costs less than $2.
  • Voice output in software applications: product descriptions, help texts or status notifications can be converted by Voxtral TTS into natural speech without producing individual voice recordings. Relevant for web portals, e-commerce platforms and internal tools.
  • Accessibility: the EU Web Accessibility Directive and EN 301 549 require digital accessibility for many business applications. Automatic speech output addresses these requirements technically, without separate audio production.
  • Customer service support: combined with a conversation agent, Voxtral can handle incoming telephone queries as a first-contact layer — a complement to human staff, not a replacement. This is where the low latency of 90 ms time-to-first-audio matters.
  • Documentation and knowledge capture: sales calls, support cases and technical briefings can all be transcribed and summarised automatically, feeding into CRM or knowledge management systems.

Voice AI no longer requires a six-figure budget or a US cloud contract. With Voxtral, it can be evaluated on existing hardware and, once the test succeeds, built into your own custom software solutions.

How to integrate Voxtral and what to check beforehand

Voxtral can be accessed in two ways: via the Mistral API (no hardware required, pay-as-you-go) or self-hosted on your own GPU infrastructure. Both approaches can be embedded in existing systems via standard REST calls; the Mistral documentation covers Python, Node.js and direct HTTP integration. Before a productive deployment, a few questions are worth answering with qualified IT consulting:

  • Check data protection first: may audio data from your use case (meeting recordings, customer calls) legally pass through an external API? Many company policies or sector regulations answer this clearly.
  • Start with the API: the Mistral API has no setup cost and lets you test voice quality and transcription accuracy against your own content before any hardware decision.
  • Validate the licence: CC BY-NC 4.0 for Voxtral TTS weights means self-hosted commercial use requires a separate agreement with Mistral. Apache 2.0 for Voxtral Realtime (transcription) has no such restriction.
  • Plan the integration path: Voxtral outputs standard audio (TTS) and JSON transcripts (STT). Both integrate cleanly into most web applications, custom software solutions and workflow tools without custom middleware.
  • Consider the European angle: Mistral as a French company, deployable on EU infrastructure, gives a different compliance baseline than US-headquartered providers — worth documenting for audits under the EU AI Act.

Explore our AI integration services

Have an idea worth building?

Tell us where you want to go. We'll help you get there with software that performs.