The state of modern AI text to speech systems for screen reader users

If you're not a screen reader user yourself, you might be surprised to learn that the text to speech technology used by most blind people hasn't changed in the last 30 years. While text to speech has taken the sighted world by storm, in everything from personal assistants to GPS to telephone systems, the voices used by blind folks have remained mostly static. This is largely intentional. The needs of a blind text to speech user are vastly different than those of a sighted user. While sighted users prefer voices that are natural, conversational, and as human-like as possible, blind users tend to prefer voices that are fast, clear, predictable, and efficient. This results in a preference among blind users for voices that sound somewhat robotic, but can be understood at high rates of speed, often upwards of 800 to 900 words per minute. The speaking rate of an average person hovers around 200 to 250 words per minute, for comparison.

Unfortunately, this difference in needs has resulted in blind people getting left out of the explosion of text to speech advancement, and has caused many problems. First, the voice that is preferred by the majority of western English blind users, called Eloquence, was last updated in 2003. While it is so overwhelmingly popular that even Apple was eventually pressured to add the voice to iPhone, mac, Apple TV, and Apple Watch, even they were forced to use an emulation layer. As Eloquence is a 32-bit voice last compiled in 2003, it cannot run in modern software without some sort of emulation or bridge. If the sourcecode to Eloquence still exists and can be compiled, even large companies like Apple haven't managed to find or compile it. As the NVDA screen reader moves from being a 32-bit application to a 64-bit one, keeping eloquence running with it has been a challenge that I and many other community members have spent a lot of time and effort solving. The eloquence libraries also have many known security issues, and anyone using the libraries today is forced to understand and program around them, as Eloquence itself can never be updated or fixed. These stopgap solutions are entirely untenable, and are likely to take us only so far. A better solution is urgently needed.

The second problem this has caused is for those who speak languages other than English. As most modern text to speech voices are created by and for sighted users, blind users begin to find that the voices available in less popular languages are inefficient, overly conversational, slow, and otherwise unsatisfactory. While espeak-ng is an open-source text to speech system that attempts to support hundreds of languages while meeting the needs of blind users, it brings a different set of problems to the table. First, many of the languages it supports were added based on pronunciation rules taken from Wikipedia articles, without involving speakers of the language. Second, Espeak-ng is based directly on Speak, a text to speech system written by Jonathan Duddington in 1995 for RISC OS on the BBC Micro, meaning that espeak users today continue to have to live with many of the design decisions made back in 1995 for an operating system that no longer exists. Third, looking at the Espeak-ng repository, it seems to only have one or two active maintainers. While this is obviously better than the zero active maintainers of Eloquence, it could still become a problem in the future.

These are the reasons that I'm always interested in advancements in text to speech, and am actively keeping my ears open for something that takes advantage of modern technology, while continuing to suit the needs of screen reader users like myself.

Over the holiday break, I decided to take a look at two modern AI-based text to speech systems, and see if they could be added to NVDA. I chose two models, because they advertised themselves as fast, able to run without a GPU, and responsive. The first was supertonic, and the second was Kitten TTS. As both models require 64-bit Python, I wrote the addons for the 64-bit alpha of NVDA. However, other than making development easier, this had little effect on the results.

Unfortunately, doing this work uncovered a number of issues that I believe are common to all of the modern AI-based text to speech systems, and make them unsuitable for use in screen readers. The first issue is dependency bloat. In order to bundle these systems as NVDA addons, developers are required to include a vast multitude of large and complex Python packages. In the case of Kitten TTS, the number is around 103, and just over 30 for supertonic. As the standard building and packaging methods for NVDA addons do not support specifying and building requirements, these dependencies need to be manually copied over, included in any github repositories, and cannot be automatically updated. Loading all of these dependencies directly into NVDA also causes the screen reader to load slower, use more system resources, and opens NVDA users up to any security issue in any of these libraries. As a screen reader needs access to the entire system, this is far from ideal.

The second issue is accuracy. These modern systems are developed to sound human, natural, and conversational. Unfortunately this seems to come at the expense of accuracy. In my testing, both models had a tendency to skip words, read numbers incorrectly, chop off short utterances, and ignore prosody hints from text punctuation. Kitten TTS is slightly better here, as it uses a deterministic phonemizer (the same one used by espeak, actually) to determine the correct way to pronounce words, leaving only the generation of the speech itself up to AI. But never the less, Kitten TTS is still far from perfectly accurate. When it comes to use in a screen reader, skipping words, or reading numbers incorrectly, is unacceptable.

The third issue is speed. Supertonic has the edge, here, but even it is far too slow. Unlike older text to speech systems, Supertonic and Kitten TTS cannot begin generating speech until they have an entire chunk of text. Supertonic is slightly faster, as it can stream result audio as it becomes available, whereas Kitten TTS cannot start speaking until all of the audio for the chunk is fully generated. But for use in a screen reader, a text to speech system needs to begin generating speech as quickly as possible, rather than waiting for an entire phrase or sentence. Users of screen readers quickly jump through text and frequently interrupt the screen reader, and thus require the text to speech system to be able to quickly discard and restart speech.

The fourth and final issue is control. Older text to speech systems make changing the pitch, speed, volume, breathiness, roughness, headsize, and other parameters of the voice easy. This allows screen reader users to customize the voice to our exact needs, as well as offering the ability to change the characteristics of the voice in real time based on the formatting or other attributes of the text. AI text to speech models, being trained on data from a particular set of speakers, cannot offer this customization. Instead, they inherit the speaking speed, pitch, volume, and other characteristics that were present in the training data. Kitten TTS and Supertonic both offer basic speed control, however it is highly variable from voice to voice and utterance to utterance. This leads to a loss of functionality that many blind users depend on.

If you'd like to experience these issues for yourself, feel free to follow the links above to my GitHub repositories. They offer ready to install addons that can be installed and used with the 64-bit NVDA alphas.

... continue reading