Google AI Isolates a Single Speech from a Mixture of Sounds


Microphone and a colorful sound waves depicting different voices in the background. Photo by Stux via Pixabay


Google researchers have recently developed artificial intelligence that can pick out voices in a crowd by simply focusing on the people’s faces as they speak. This technology may possibly boost audio quality for hearing aids and video chats like Duo and Hangouts.

In its blog post, Google Research’s software engineers Inbar Mosseri and Oran Lang shared that people are good at focusing their attention at a certain person even if they are in a noisy environment by mentally muting all the other sounds and voices. Such capability is what they refer to as the cocktail party effect. However, while the capability is natural to humans, it remains a challenge for computers, such as in automatic speech separation or separating the audio signal into the individual sources of speech.

Inspired by the cocktail party effect, the team presented a deep learning audio-visual model that can isolate a single speech from a mixture of sounds, including background noise and other voices. The software engineers claimed that their team was able to computationally produce videos, wherein certain speeches of people were enhanced and other sounds were suppressed.

“All that is required from the user is to select the face of the person in the video they want to hear,” wrote the software engineers. The two also pointed out that the technology functions even in the ordinary videos that contain a single audio track. Their team highlights that the new AI can be applied in speech recognition in videos, speech enhancements in video conferencing, and can improve hearing aids in times when there are multiple people speaking at the same time.