Easy Remote: Television’s First Killer App
If you’re one of AT&T’s four million U-verse® customers who owns an iPhone or iPad, you’ll soon be able to ask it, “What’s on TV?” thanks to AT&T Research innovators Harry Chang, Charles Galles, and Michael Johnston, who’ve created the world’s first voice-search feature for television, now found in AT&T’s iPhone app, Easy Remote.
We’ve been asking, “What’s on TV?” for about seventy years now, ever since television stormed America after World War II. Despite the advanced technologies that make up our modern television viewing experience, technologies like high definition, flat screens, and IPTV, the way we watch television hasn’t changed much. Even our remote controls are not drastically different than those first produced in the early 1950s.
The problem now is that “What’s on TV?” is a far more complicated question than it was in 1950, or even 1990.
Modern cable and IPTV television services like AT&T’s U-verse® offer consumers hundreds of channels—and that number will one day top 1,000. In addition to regular programming, there are thousands of on-demand movies and shows. Managing all this choice is a complex problem.
Over the past decade, all conventional residence television service providers as well as a new generation of providers have looked at the hard technical problem of making voice-searching of television content as easy as picking up a TV remote control—the most widely used consumer device at home.
-Harry Chang, Lead Member of Technical Staff at AT&T Research
Many of us resort to sticking to a small subset of favorite channels, or perhaps we only watch a select few shows. Yes, we understand that we’re missing out on lots of great content that we’re paying for, but how do we find it?
Current set-top box search mechanisms are cumbersome and ineffectual. Many people find it difficult to spell out the names of shows, not to mention more natural searches like “movies with Arnold Schwarzenegger,” one letter at a time using a common TV remote.
“Consumers want their experience with television to be straightforward and relaxing, rather than getting bogged down with complex menus and onscreen keyboards. Easy Remote lets you just say what you want, sit back and watch it,” says Michael Johnston, Principal Member of Technical Staff at AT&T Research.
Like all great apps, Easy Remote does something that is conceptually simple. Behind the scenes, however, providing a high-precision search result for a simple search term like “New York Giants” is no easy task.
According to Johnston, “Voice search for television poses different challenges from other speech recognition tasks. The underlying data is highly dynamic and recognition models need to be constantly adapted to maintain performance.”
When you utter a phrase such as “ABC” into Easy Remote, your voice is first digitized and wirelessly routed through the Wireless Gateway for U-verse® secured home network. From there, your query is routed through the superfast AT&T broadband network to the cloud.
Easy Remote relies on something called the AT&T Speech API, a cloud-friendly version of AT&T WATSONSM, AT&T’s speech recognition technology. AT&T Speech API, announced in June 2012, is designed to handle billions of voice requests from all manner of devices such as smartphones, tablets, PCs, smart cars, or a connected house—basically anything with a microphone. Creating an electronic programming guide (EPG) voice-search engine on top of this technology presented an enormous technical challenge for researchers Chang, Galles, and Johnston.
Somehow, the system would have to create new speech models daily, based on existing data.
According to Chang, “Voice-searching in the television domain demands a rich, dynamic vocabulary of search terms. An EPG contains thousands upon thousands of titles, directors, actors, descriptions, nicknames and initials, substitute names like Star Wars A New Hope vs. Star Wars IV, and even common mispronunciations. This results in a vocabulary of hundreds of thousands of words.”
Consider, for example, all that must happen to present meaningful results for a simple query like “ABC.” First, AT&T WATSONSM must recognize that ABC is a three-letter word. Then the search engine has to figure out geographically where the app is being used so it knows which program guide to search. Then it must determine the true intention of the search, in this case, to find programs on a local station affiliated with the TV network ABC. Finally, it must also expand the meaning of “ABC” and consider other channels that are related, like ABC News Now.
Worse, all of this changes daily and therefore requires updating complex speech-recognition models. Normally, building speech models requires time, testing, and a lot of sample data in order to train the system to anticipate which words and phrases are most likely used in the real world in a given context. Somehow, the system would have to create new speech models daily, based on existing data.
“Each day you have to successfully process data streams related to thousands of national and local programming events. Your processes have to be robust to unusual program names, singular events, and new hit shows. It leads to a lot of moving parts, and all of these moving parts have to work well to keep the recognition models fresh and ready for each day's programming lineups,” says Charles Galles, Principal Member of Technical Staff at AT&T Research.
The key to making such a system work would come from voice-searchers themselves. For the first time since the invention of the modern TV remote, researchers had to answer a fundamental question: just what would people ask a voice remote if they had one? Chang and Johnston, with help from researchers Bernie Renger and Iker Arizmendi, set out to find out in 2006.
. . . providing a high-precision search result for a simple search term like “New York Giants” is no easy task.
“We started with a series of controlled studies in collaboration with AT&T’s Human Factors organization. These studies revealed that over 90% of all possible search terms in spoken form can be automatically discovered from the EPGs delivered to AT&T daily, covering all sixty-six U-verse® markets. The trick is keeping a long history of data, between two and four years’ worth,” says Chang.
The researchers found that viewers tend to like the same programs over time, even when the programs are rescheduled. Similarly, viewers tend to like the same actors over time. Through a number of data-driven and content-analysis techniques, their TV Rating Engine automatically determines the most often searched shows and actors, allowing them to build speech models more efficiently.
Chang, Galles, and Johnston worked on building their EPG Data Miner and TV Rating Engine for two years before coming up with an implementation plan in 2010, which allowed the team in Research to build the brains behind their voice-search feature.
So now, when you say “news” into Easy Remote, their highly complex system is able to return a list of candidate shows and movies back to the Easy Remote app in mere seconds.
To turn six years of research and prototypes into a product ready for the App store, the team at AT&T Research relied on the AT&T Foundry, which helps drive innovation and bring ideas to market faster. “We built the first working prototype of Easy Remote’s voice-search feature at AT&T’s Foundry workshop a year ago, so that’s about a year from prototype to App Store,” Chang says.
So what is on TV? Thanks to countless hours over six years of hard work by the Easy Remote team at AT&T Research and the power of AT&T WATSONSM technology, you can now simply ask your remote.
Building the Easy Remote language model
All voice-enabled apps rely on a language model to recognize spoken words.
Creating a language model requires collecting and transcribing millions of sample utterances, statistically analyzing them to know what words are most likely to be spoken. It can take months or years to refine a model for production environments.
This can’t work for TV programming, which changes daily, and where the most likely phrases—titles of new shows—aren't yet included in the model because they haven't yet been spoken.
For Easy Remote, researchers created a dynamic hierarchical model that combines data from different sources: transcribed data from existing statistical and rule-based models to bootstrap the system, text from electronic program guides to do daily updates, and popularity information scraped from external websites to properly weight the titles most likely to be searched.
Every night the Easy Remote language model is updated, put through a quick self test, and immediately put into production.