Disclosure: Amplify is an investor in Gradium.
If AI research is Star Wars and OpenAI is the death star, then without a doubt the rebels are building audio models. The best models for voice – TTS, STS, STT, and the like – are not coming from the big labs. Instead, they’re built by their underfunded, understaffed, and underhyped siblings, a wave of incredible startups that is improbably crushing benchmarks with every model release. And if you believe that audio is the biggest future modality for AI – like many researchers do – this is one of the more interesting and underdiscussed topics in genAI today.
One of these improbably cutting edge startups is Gradium, born out of the open lab Kyutai.
In summer 2024 on a stage in Paris, a Kyutai researcher (his name is Neil) demoed the first realtime audio conversation with AI. This model (Moshi) could respond in real time, change its voice style and volume on request, and even recite an original poem in a French accent (research shows poems sound better this way).
You’ve probably seen audio AI demos before. You may not be particularly impressed. Didn’t OpenAI do this a few years ago? Well, not exactly:
This was the first full-duplex conversational AI model. Moshi could interrupt, be interrupted, backchannel ("uh-huh", "I see") and respond in around 160ms (faster than most human conversations). This demo happened before OpenAI released Advanced Voice Mode, and a full year before xAI released a similar demo (with more latency).
This would have been a groundbreaking release from a major lab, except it wasn’t from a major lab, it was from a team of 4 (four) researchers who built it completely from scratch (without a pre-trained base) in 6 months. The model is open source, and can even run on mobile. Oh, and the team was part of a non-profit with extremely limited funding. How did they do it?
Based on extensive interviews with the Gradium team, this post is going to go in technical depth on an incredibly interesting niche of the increasingly top heavy AI world:
A brief history of audio ML, and why it’s consistently overlooked
Dynamics of big labs and why small teams of researchers can outperform
... continue reading