The new model, called VSSFlow, leverages a creative architecture to generate sounds and speech with a single unified system, with state-of-the-art results. Watch (and hear) some demos below.
The problem
Currently, most video-to-sound models (that is, models that are trained to generate sounds from silent videos) aren’t that great at generating speech. Likewise, most text-to-speech models fail at generating non-speech sounds, since they’re designed for a different purpose.
In addition, prior attempts to unify both tasks are often built around the assumption that joint training degrades performance, leading to setups that teach speech and sound in separate stages, adding complexity to the pipeline.
Given this scenario, three Apple researchers, alongside six researchers from Renmin University of China, developed VSSFlow, a new AI model that can generate both sound effects and speech from silent video in a single system.
Not only that, but the architecture they developed works in a way that speech training improves sound training, and vice versa, rather than interfering with one another.
The solution
In a nutshell, VSSFlow leverages multiple concepts of generative AI, including converting transcripts into phoneme sequences of tokens, and learning to reconstruct sound from noise with flow-matching, which we covered here, essentially training the model to efficiently start from random noise and end up with the desired signal.
All of that is embedded in a 10-layer architecture that blends the video and transcript signals directly into the audio generation process, allowing the model to handle both sound effects and speech within a single system.
Perhaps more interestingly, the researchers note that jointly training on speech and sound actually improved performance on both tasks, rather than causing the two to compete or degrade the overall performance of either task.
... continue reading