StorEBook: How Goldilocks and the Three Bears is Driving Text-to-Speech Technology
What do a three-year-old girl and the familiar, simple tale of Goldilocks and the Three Bears have to do with text-to-speech (TTS) technology? According to Taniya Mishra, everything.
“Current TTS technology doesn’t do a good job of expressing the complex and often strong emotions in children’s stories,” Mishra, a researcher in AT&T Research’s Speech Algorithms and Engines department, says. “My three-year-old would walk away from anything that sounded like a computer reading a story. She’d just think it was boring. Kids are the toughest customers sometimes.”
Current TTS technology is perfectly suited for many tasks, such as reading menu choices over the phone or relaying directions on our cell phones. What it doesn’t do well, at least not yet, is add emphasis for dramatic or emotional effect.
I’ve had more than a few people call this a cute project . . . the technology underlying it is anything but that.
A year and a half ago, Mishra was reading to her then two-year-old when she realized that enhancing a TTS system so that it could read even the simplest of children’s stories effectively would indeed be a challenging problem with the potential to advance the technology significantly. She calls her project StorEBook, and it’s now in an early prototype stage.
“I’ve had more than a few people call this a cute project. StorEBook's outward manifestation is indeed very cute—after all, it does involve kids, bears and silly voices. But the technology underlying it is anything but that. For a system to automatically recognize characters, their personality traits and emotions from the text of the story, to predict which is a right voice for each character and finally, to create and not simply play-back synthetic speech in character-appropriate voices is a very challenging technological problem, and thus requires the use of very complex and sophisticated search and learning algorithms—which are not the least bit cute.”
The backbone of her project is AT&T’s Natural VoicesTM TTS technology. From hours of recorded speech, computers digitally slice up thousands upon thousands of short speech segments called phonemes—simple sounds that make up all speech—and store them in a database, along with such information as the pitch, duration, and amplitude of each slice. And we’re talking about lots of data here. For example, Natural Voices stores thousands of recorded versions of a single type of “t” sound (and there are more than a few t’s).
The same phoneme (/t/) in two contexts requires two different sounds from the database.
When a user inputs text, the system breaks it up into a string of phonemes, itself a complex task. Next, the system uses highly efficient algorithms to search its database for the most natural sounding samples to join together smoothly with cutting-edge digital signal processing techniques and output to your speakers.
Mishra believes that something TTS researchers call prominence will play an important role in StorEBook. Prominence is the relative emphasis people place on different syllables and words as they speak. The level of prominence affects the amplitude (volume), pitch, or speed.
Like a fingerprint, we each have unique prominence patterns. We all say even the simplest of phrases, like “I like blueberry pancakes,” in different ways.
Mishra hopes to take this important concept one step further with StorEBook because, she points out, “a big bad wolf may perhaps use different prominence patterns than a cute talking tea cup.” Different emotional states of the characters would also necessarily require different prominence patterns. For example, imagine a bored office worker commenting to his friends in the cafeteria: “I hate spinach.” Now imagine how a restless five-year-old might say the phrase at dinner: “I! HATE! SPINACH!”
A significant advantage of our prominence prediction system is that it adapts to the way a particular person speaks.
AT&T Research’s Summer Intern Program is a vital part of life for busy researchers, including Mishra. Recently, she and a summer intern and PhD candidate from the University of Texas at Dallas, Mahnoosh Mehrabani, collaborated on figuring out a way to make Natural Voices more expressive.
TTS systems normally use long-standing linguistic rules to predict prominence. These rules, though generally correct, will not capture the quirks and mannerisms of a particular speaker, quirks and mannerisms that give each one of our voices a distinct expressive quality. For this, human “labelers” are used to help build models that predict prominence patterns for a particular voice (or character). Labelers listen to and judge hundreds of recorded examples in order to help build good models. Although this method is accurate and requires a relatively small amount of data, this “supervised” process takes time and money.
Because StorEBook could potentially require hundreds of highly expressive voices (and therefore make it prohibitively expensive to build), prominence-prediction models for different characters need to be built automatically in an “unsupervised” fashion. Mishra and Mehrabani set out to build such a system.
By studying the speech characteristics of a previously recorded voice across a number of parameters, they discovered that they could predict prominence patterns automatically with results that closely match those produced manually. In other words, they derived their own rules about prominence based on a statistical analysis of the unique characteristics of a particular speaker’s recorded voice.
“A significant advantage of our prominence prediction system is that it adapts to the way a particular person speaks, so it mimics the expressiveness of that voice very closely. By way of contrast, rule-based systems use general linguistic rules and therefore cannot capture the quirks and mannerisms that give every voice its unique expressive quality,” says Mehrabani.
Subtle changes in prominence (see sidebar) are but a small step towards a fully realized StorEBook system, however. Mishra and Mehrabani’s system is quite good at analyzing context in terms of individual phrases, but Mishra believes that analyzing the context of entire paragraphs, and perhaps even whole stories, is the key towards making her StorEBook project a reality.
Giving the kids the ability to select their parents/grandparents voice as one of the character voices is both useful and fun.
StorEBook contains a database of metadata that help it to determine how to tell a given story. One example is knowing when each character or a narrator is speaking. Another is choosing an appropriate character voice based on the attributes of a given character. In its prototype form, researchers enter this information manually on a per-story basis.
But as the number of stories grow, so too will the similarities between them. Mishra envisions a system that learns to recognize these similarities in order to make predictions about metadata based on information it has about other stories. For example, the kinds of words bad guys use or the fact that younger characters use smaller, simpler vocabularies might be clues about the types of characters that are in the story. With such information, StorEBook could, all on its own, choose appropriate pre-recorded generic character voices, recorded by professional actors . . . or even grandma?
Mishra and her colleagues are already at work on that, figuring out how to convert a couple hundred sentences recorded by anyone, mom, dad, and grandma included, into generic voices for use as characters. This is an aspect of StorEBook that Mishra finds particularly interesting:
“Giving the kids the ability to select their parents/grandparents voice as one of the character voices is both useful and fun. It is useful in that research has shown that when kids are read to in a familiar voice, they learn and retain the material better. It is also fun because kids get to have their parents—especially grandparents, who may live far away—‘reading’ to them, even when they are physically not there. Another interesting aspect of this is voice banking, which is creating a synthetic version of a person's voice in case they lose their voice. Voice banking is pretty cool because it means that in a way my voice could live on, reading and creating new words for future generations, even when I am not there!”
In its prototype form, a young reader taps a sentence on an iPad, and StorEBook highlights each word as it’s read, switching character voices along the way. Here, Mishra demonstrates the technology:
Talking educational toys have been around since the 1970s. Anyone remember Speak and Spell? Or the talking 2-XL robot? StorEBook is different. First, with so much artificial intelligence planned for the project, it’s hard to argue that StorEBook is a toy at all. Second, unlike other readers, StorEBook will run on practically any device with access to the web because it uses HTML5 technology, and users will eventually be able to input any story at all for dramatic interpretation.
Certainly Mishra believes the future holds much more profound uses for her StorEBook technology:
“Five years from now, I hope that StorEBook will be a completely normal technology that kids and their parents use easily and often for reading books—including self-authored ones—using a multiplicity of fun and familiar voices, both for learning and for fun. My further hope is that StorEBook could evolve to be a useful tool that could be used by children with learning disorders.”
Listen for yourself
The following audio samples, taken from StorEBook's version of Goldilocks and the Three Bears, demonstrate the increased use of prominence to give expression to computer-generated speech.
Narrator voice: The bears return to their house.
Papa Bear: Why someone has been tasting my porridge.
Mama Bear: What? Let me see. Look! Someone has left a spoon in my porridge too.
About the researchers
Taniya Mishra is a Senior Member of Technical Staff at AT&T Research in the Speech Algorithms and Engines Research Department. Mishra completed her PhD in Computer Science from the OGI School of Science & Engineering at OHSU in 2008. Her thesis introduced a new algorithm for intonation analysis and applied it to text-to-speech synthesis. At AT&T Research, Mishra works on speech synthesis, voice-enabled search, prosody modeling, voice signatures and their application to speaker recognition, speech synthesis, and other speech technologies.
Mahnoosh Mehrabani received her Masters degree in Electrical Engineering from the University of Texas at Dallas in 2009, and is currently a PhD candidate. She has been a Research Assistant with the Center for Robust Speech Systems at the University of Texas at Dallas since 2007. Her broad research interests include speaker clustering and subspace modeling, speaking styles, prosodic features, speech synthesis, speech analysis, language and accent classification.