TL;DR: Mandarin pronunciation has been hard for me, so I took ~300 hours of transcribed speech and trained a small CTC model to grade my pronunciation. You can try it here.
In my previous post about Langseed, I introduced a platform for defining words using only vocabulary I had already mastered. My vocabulary has grown since then, but unfortunately, people still struggle to understand what I'm saying.
Part of the problem is tones. They're fairly foreign to me, and I'm bad at hearing my own mistakes, which is deeply frustrating when you don’t have a teacher.
First attempt: pitch visualisation
My initial plan was to build a pitch visualiser: split incoming audio into small chunks, run an FFT, extract the dominant pitch over time, and map it using an energy-based heuristic, loosely inspired by Praat.
But this approach quickly became brittle. There were endless special cases: background noise, coarticulation, speaker variation, voicing transitions, and so on.
And if there’s one thing we’ve learned over the last decade, it’s the bitter lesson: when you have enough data and compute, learned representations usually beat carefully hand-tuned systems.
So instead, I decided to build a deep learning–based Computer-Assisted Pronunciation Training (CAPT) system that could run entirely on-device. There are already commercial APIs that do this, but hey, where’s the fun in that?
Your browser does not support the video tag.
Architecture
... continue reading