Exploring JEPA for real-time speech translation

Why audio encoders matter

Imagine you’re on a video call with a colleague in Tokyo. She’s explaining, with rising excitement in her voice, why a project deadline needs to move. You don’t speak Japanese, so a translation system kicks in. That system transcribes her speech to text, translates the text, and reads it back with a robotic voice. You might get the words, but you lose the urgency, the sentiment, the person. That’s the fundamental problem with traditional translation models based on cascaded translation pipelines. The cascade pipeline of ASR → MT → TTS treats audio as a container for text. It cracks open the container, throws the container away, translates the text inside, and builds a new container from scratch. Everything that made the original speech human, like pitch and rhythm, gets discarded at step one and cannot be naturally recovered.

A better approach works directly with rich audio representations that carry both what was said and how it was said. Systems like Meta’s SeamlessStreaming and Kyutai’s Hibiki point in this direction: encode the source speech into a representation that preserves meaning alongside paralinguistic information, then decode that representation into the target language while keeping the speaker’s characteristics intact.

A core component for the speech-to-speech approach is the audio encoder, which takes raw audio and compresses it into a useful representation. If the encoder strips out emotion and prosody (like a transcription model does), no downstream translator can recover them. If the encoder preserves them, everything downstream has a better chance. The quality of this encoder determines the ceiling of everything downstream.

That’s what JEPA-v0 is: our attempt to build an audio encoder whose representations are rich enough to power real-time speech-to-speech translation.

Why self-supervised? The labeled data bottleneck

Consider what you’d need to train a supervised multilingual audio encoder from scratch. You’d want parallel speech data: the same utterance spoken in English, Portuguese, Japanese, Arabic, Mandarin, and dozens more languages. You’d want speaker labels, emotion annotations, prosody markers. Datasets like this barely exist, and where they do, they cover a handful of language pairs in controlled recording conditions.

Supervised speech encoders like Whisper sidestep this by training on a massive but narrow task: transcription. Whisper saw 680k hours of labeled speech-text pairs, and it learned extremely good representations for turning speech into text. But its representations are optimized for lexical content, not for the paralinguistic features a translation system needs.

Self-supervised learning flips the script. Instead of telling the model what to learn from the data (transcribe this, classify that), you let the model discover structure on its own. The model learns from the raw audio itself without anyone labeling anything. This is the same insight that made BERT and GPT transformative for text: pre-train a general representation from unlabeled data, then let downstream models specialize.

For JEPA-v0, this means we can train on millions of audio samples spanning speech in multiple languages, environmental sounds, and music, all without needing a single label. The encoder learns a rich structure to represent the data: phonetic patterns, speaker characteristics, emotional valence, rhythmic structure, acoustic events.

... continue reading