Omnilingual ASR: Advancing automatic speech recognition for 1600 languages

Takeaways:

Automatic speech recognition (ASR) systems aim to make spoken language universally accessible by transcribing speech into text that can be searched, analyzed, and shared. Currently, most automatic speech recognition systems focus on a limited set of high-resource languages that are well represented on the internet, often relying on large amounts of labeled data and human-generated metadata to achieve good performance. This means high-quality transcriptions are often unavailable for speakers of less widely represented or low-resource languages, furthering the digital divide.

Today, Meta’s Fundamental AI Research (FAIR) team is introducing Omnilingual ASR — a groundbreaking suite of models that deliver automatic speech recognition for more than 1,600 languages, including 500 low-resource languages never before transcribed by AI. We’re also open sourcing Omnilingual wav2vec 2.0, a new self-supervised massively multilingual speech representation model scaled up to 7B parameters that can be leveraged for other downstream speech-related tasks. In addition, we’re releasing the Omnilingual ASR Corpus, a unique collection of transcribed speech in 350 underserved languages, curated in collaboration with our global partners.

This work supports our goal of building technology to help bring the world closer together. Omnilingual ASR is a significant step toward delivering a truly universal transcription system and expanding access to speech technology worldwide, ensuring that high-quality speech-to-text systems are accessible to even the most underrepresented language communities. The hope is to ultimately break down language barriers and enable communication across diverse linguistic and cultural backgrounds.