Neural audio codecs: how to get audio into LLMs

Václav Volhejn Thank you for the valuable feedback on the drafts: Chung-Ming Chien, Moritz Boehle, Richard Hladík, Eugene Kharitonov, Patrick Perez, and Tom Sláma. I’d also like to thank the rest of the Kyutai team for the the research discussions without which this article could not exist.

Click to play The plan: sandwich a language model in an audio encoder/decoder pair (=neural audio codec), allowing it to predict audio continuations.

As of October 2025, speech LLMs suck. Many LLMs have voice interfaces, but they usually work by transcribing your speech, generating the answer in text, and using text-to-speech to read the response out loud. That’s perfectly fine in many cases (see Unmute), but it’s a wrapper, not real speech understanding. The model can’t hear the frustration in your voice and respond with empathy, it can’t emphasize important words in its answer, it cannot sense sarcasm, and so on.

Yes, there are LLMs (Gemini, ChatGPT’s Advanced Voice Mode, Qwen, Moshi) that understand and generate speech natively. But in practice, they’re either not as smart, or they behave like text model wrappers. Try asking any of them “Am I speaking in a low voice or a high voice?” in a high-pitched voice, and they won’t be able to tell you.

Clearly, speech LLMs lag behind text LLMs. But why? For text, we found out a few years ago that if you take a lot of text data, a big Transformer, and a lot of GPUs, you’ll get some pretty damn good text continuation models. Why can’t we just replace text with audio and get pretty damn good speech continuation models?

As a teaser, here’s what happens when you try to do that naively (warning, loud):

Loading audio...

We’ll have a look at why audio is harder to model than text and how we can make it easier with neural audio codecs, the de-facto standard way of getting audio into and out of LLMs. With a codec, we can turn audio into larger discrete tokens, train models to predict continuations for these tokens, and then decode those back into audio: see animation above.

Kyutai folks have done a lot of work in this space, which is part of the reason I chose to cover this topic. We’ll start from the basics and build up all the way to Mimi, our neural audio codec. It was originally developed for Moshi and later adopted by others for their models, notably Sesame’s CSM.

... continue reading