Tech View: Technology for Making TV Viewing Easy

by: Harry Chang, Bernard Renger, and Michael Johnston, Mon Oct 12 10:07:00 EDT 2009

TV viewing has become harder. With massive amounts of available content—some homes have access to 400+ channels, a number that may rise soon to 1,000—and accessory devices such as set-top boxes, DVRs, converter boxes, TiVo®, and VCRs, watching and recording TV requires some effort. Programs of true interest and their times have to be hunted down amid all the other programs, buttons on multiple remotes have to be identified, and the interfaces of the DVR and other devices must be mastered (some more intuitive than others).

Cutting through this clutter to make TV viewing easy again is becoming increasingly important. Solutions include universal remotes, Windows®-type interfaces, and even hand gestures, but they themselves require some learning and may be just a different way to manage the clutter.

One solution, however, requires little or no learning, is simple and natural and intuitive, and actually does away with clutter-speech. Instead of scrolling through lists or navigating a menu or tapping out instructions, a speech interface let's you say what you want to watch. Speaking "The Real Housewives of New Jersey" or "Knicks game" into a microphone-based remote would return a short, selectable list of programs fitting the description, giving times and days.


Technologies exist

How long before speech-controlled TV is a reality? Maybe not too long. The technologies are pretty much in place.

  • Automatic speech recognition (ASR), the ability of computers to understand naturally spoken language, is widely used in call center applications where speech recognition software can recognize whether someone is calling to pay a bill or apply for a mortgage. The same software could be adapted for TV programming by modifying the language models that contain the vocabulary to be recognized.
  • Remote controls equipped with a microphone to relay speech exist now using the common RF (radio frequency) technology found in cordless phones. Future remotes will incorporate Wi-Fi to enable communication with other Wi-Fi devices while streaming audio of spoken commands.

What's missing to make speech-controlled viewing possible are the computing resources needed to perform speech recognition on the thousands of TV shows and videos available to consumers. A speech recognition application works by matching spoken words to words (really sounds of words) contained in a language model that has been carefully constructed for a specific application and context, such as a banking call center. For speech-controlled viewing, the language model would contain everything in an electronic program guide. Not only is this a huge collection of words to recognize, it's one that would have to be updated at least nightly to keep it current with programming changes.

While some devices have onboard speech recognition (the iPhone, for example), device-based speech recognition would be impractical for a model as large as the one required by an electronic program guide. Neither the TV nor set-top box (or remote) has sufficient computing resources for performing speech recognition on all available content. While there are hardware solutions for adding PC-type resources to these devices, new hardware increases the costs to the consumer.

In addition, there is the very practical problem of having to download to each device a new language model as programming changes are made. Downloading might occur as often as every night. Software changes would also have to be periodically downloaded.

The preferred solution is to do all the speech processing on networked servers, and connect the TV set-top box to the same network using a residential, or home, gateway. Commands spoken into a remote would be relayed from the gateway over the network to a central location where servers would perform the necessary speech recognition.

This scheme is similar to using an iPhone or other smartphone to request a listing from a business search such as YELLOWPAGES.COM®. The spoken request is relayed via the cell network to speech recognition servers, where the spoken request is converted to commands used for the database lookup.

. . . do all the speech processing on networked servers, and connect the TV set-top box to the same network . . .

With speech recognition done at a central location, updates to both software and the language models can be done easily and as frequently as needed. There is an added benefit. The spoken commands used by people in the home represent hard-to-get, real-world sample data that can be used in training the models to improve performance. If speech recognition was done on the device, this valuable resource would be more difficult to exploit for refining the models.


Benefits of a connected TV

If server resources become available to at-home TV viewers, a lot more can happen to make TV viewing easy again.

With servers performing program lookup, searches can become more complex and involve metadata such as program type, actor, or any combination of search criteria. Thus you could search not just by title but also by genre or actor. Speaking "Late Night Comedy Show," "Basketball games on Sunday night," or "Movies with Kiefer Sutherland" would get back a list of programs fitting the description.

But why stop there? With computing resources available, why not put the server resources to work to suggest programs or videos you might like? You could just say What's on tonight? and have the TV return a list of suggestions tailored for your preferences. If Netflix® and book vendors can make reasonable guesses as to what you'd like, TV providers can run similar algorithms on TV programming data stored on the centralized servers.

The servers could also program the DVR for you. If you want to record a program, just say what you want to record and once the speech recognition is performed, the instructions could be sent directly to the DVR. No need for you to be involved other than to say what you want.


The potential of IPTV

Connecting TV devices to a content provider's servers over a network is the IPTV model, a relatively new paradigm in which a private provider (usually the telephone company) distributes content over an existing broadband infrastructure by first encoding it and relaying it as a series of IP (Internet Protocol) packets over a broadband network, rather than over traditional cable or the airwaves.

(Although IP is the same protocol used to relay video over the Internet, IPTV is not TV over the Internet; it's not the same as watching YouTube clips on a PC. Instead, IPTV is high-quality, high-resolution video that's delivered over a broadband connection, which can also deliver Internet content such as web pages, YouTube video, email, etc.)

An example of an IPTV service is AT&T's U-verseSM, where a home gateway connects the TV set-top box to AT&T's broadband network, enabling the set-top box to communicate with servers running AT&T's WATSON ASR (and other speech technologies) as well as with other devices on the network. The TV set-top box is essentially an endpoint on the network, just the same as a PC, laptop, or iPhone.

To stream speech to the network servers, it is advantageous to exploit the Wi-Fi feature of the U-verse gateway. With its Wi-Fi capability, the gateway could also serve as an access point for any Wi-Fi-enabled devices (TV remote, laptop, iPhone) to communicate with other endpoints on the IP-based home network, including any home PCs. Thus computer files, including emails and photos, could be viewable on the TV, and any Wi-Fi enabled device (such as the iPhone) could control the set-top box and DVR.

IPTV diagram
In AT&T's U-verse, a residential gateway enables the TV set-top box to be an endpoint on AT&T's broadband network and request programming and services from AT&T.


If the TV's set-top box is a node on a network, communication between the home and the provider becomes two-way, with commands going out and programs and device instructions coming in. While the full ramifications of two-way communication are not yet known, it is certain that interactivity will be a major benefit of IPTV and will include much more than just shopping or participating in game shows from home.


Remaining hurdles

Speech-controlled viewing is not quite ready due to several factors. Some are long-standing ASR problems, such as the constant puzzle of predicting what people may say and the difficulty in recognizing uncommon accents or the high-pitched voices of children.

In addition, speech-controlled viewing brings its own set of hard problems.

  • The vocabulary for the electronic program guide is not only huge (100K+) to encompass the names of thousands of titles, directors, and actors (some actors may appear in a single movie), but also poses unique challenges. Unlike other large-vocabulary applications such as 411 Directory Assistance Service, the program guide vocabulary would need to recognize many substitutions since some actors may be known by initials or popular nicknames. Some names, especially foreign names, may be pronounced incorrectly.
  • Background noises could interfere with spoken commands to the TV. Living rooms can be noisy places, and more work is needed to improve the remote to distinguish and filter out background noises; the TV itself is often the cause of background noise, so the problem is not easily solved. Since creating a highly directional microphone may add to the cost, filtering algorithms may be developed to prevent background noise from interfering with accurate speech recognition.
  • Constant, perhaps daily, updates to the speech recognition models would be needed to accommodate and include new programming. Normally updating and training language models requires time, testing, but most of all sample data (utterances) to help anticipate exactly what words and word combinations will be used in the real world. Since new programs and new videos are always being added, there needs to be a way to roll out new language models and train them from available data.

But these are problems for the engineers.

What would consumers have to do to make TV-viewing easy again?

Not much, other than subscribing to an IPTV service that offers speech recognition and a voice remote. For many people, having IPTV means replacing a cable service with an IPTV one. (IPTV service will normally be offered as part of a triple- or quadruple-play package that includes phone (wireline and wireless), Internet, and TV.

From then on everything else is easy since the provider maintains the servers and software and takes on the responsibility for programming the DVR. All consumers need to do is say what they want to watch using their own words.

 How to Use a Voice Remote


Tech View: Views on Technology, Science and Mathematics

Sponsored by AT&T Labs Research

This series presents articles on technology, science and mathematics, and their impact on society -- written by AT&T Labs scientists and engineers.

For more information about articles in this series, contact: