Synthetic Voices with Rupal Patel of VocalID

VFH Episode 45

In this episode, Teri welcomes Rupal Patel, the founder and CEO of VocaliD, a voice AI company that creates custom synthetic voice personalities so that brands and individuals can be heard as themselves.

Rupal is also a professor at Northeastern University in Boston and has a background in speech science. VocaliD is looking at creating synthetic voices from voice recordings of real people. They are able to produce synthetic voices that are so life-like that it’s actually being used as a way for a person that does not have a voice, or is going to lose their voice, to be able to still have a voice.

Key points from Rupal!

Learning about how people produce speech and using that basic science to develop new technologies that can help and assist people with impaired speech in learning how to speak and cope with their speech disorders.

Transitioning into Voice Technology

At Northeastern University, she works in speech and hearing sciences.
Her earlier work was in developing assistive technologies, but in the last few years or so, she has also been working on learning technologies.

The Origin of VocaliD

It’s a project she started in her lab in 2007. It brought together the basic speech science and the design of assistive technology. Her lab at Northeastern University is called the Communication Analysis and Design Laboratory. She’s been on leave from the university for the last few years.
VocaliD started because they found that people who couldn’t speak clearly had severe neurological disorders in speech production, but they realized that they could still be able to control certain aspects of the voice (the prosody of one’s voice),
She found that most people with speech impairment had to use assistive technologies to communicate because their speech wasn’t clear enough to interact with people who weren’t familiar with them.
The voices that were used on the assistive technologies were very few, so Rupal saw an opportunity to develop customized synthetic voices to fit each individual.

VocaliD’s Current Work

They moved out of the lab at Northeastern in 2015 and got some funding from the government (National Science Foundation and National Institute of Health) to take the laboratory-based science and turn it into commercial products.
Between 2015 and 2017, they were focused on getting that technology ready to be used and integrated with existing assistive technologies, and focusing more on users of assistive technologies.
They kept refining the technology and the custom voices they were creating, and by 2018 they started seeing more interest from broader market applications (apps that talk, that didn’t want to sound the same as Alexa or Siri).
They are now working with companies across a variety of different verticals that are interested in creating a custom voice identity for their product or brand.

The Process of Creating a Custom Voice

They start by recording a person’s speech and then gluing together little bits of the speech sounds to create the synthetic voice.
They are now doing parametric speech synthesis where after doing the voice recording, they don’t glue together the little bits of speech sounds, but instead learn the pattern (through an algorithm) of how the individual speaks and then try to emulate that.

Synthetic Voices with Rupal Patel of VocalID

The Human Voicebank

This was an initiative that they started when the company was still in it’s early startup stage.
The people they were making the customized voices for could still vocalize so the company could still get some sound from them. The initial technique they used was to get whatever sound they could get from the individual, and then try to find a surrogate voice donor who could produce 5 to 7 hours of speech. They would then mix the sound sample they got from the client with the speech donor’s voice.
They needed voice volunteers to donate their voices and they created a massive dataset of people from around the world. 26,000 people from 110 countries with ages ranging from 6 to 91 years have contributed to the voicebank so far.
That dataset is what enables them to create voices for those who can’t speak.
While they don’t use any of the voices in the voicebank to create voices for enterprise clients, they created an easy-to-use online voice recording platform that anyone in the world can use to record their voice for the voicebank.

Use Case Stories

The most powerful use cases are where people bank their voice because they have a few days to lose it. Prior to this technology, people had two options, using an electrolarynx or having a tracheostomy speaking valve fitted. Their technology offers a better alternative.
Some people use their technology during the first 3 or 4 months of recovery right after voice-related surgery because they have no other way to communicate otherwise. As they get through voice therapy, they can have an option of communicating with those around them.

Security Applications

In 2018, they were approached by a large national institution to test their voice authentication systems.
With banks, for example, people can access their accounts using their voices. There is an authentication that takes place where the voice of the speaker is compared to a pre-recorded and saved voice print.
As speech synthesis technology gets better, it’s going to be more difficult for machines and people to tell the difference between a synthetic voice and a real one, and so they have been creating tools to recognize synthetic voices to ensure that the synthetic voice technology is not misused in future.