iMIRACLE - Content Retrieval on Mobile Devices with Speech

What is iMIRACLE?

iMIRACLE uses large vocabulary speech recognition to recognize metadata words (titles, genre, channels, etc.) and content words that occur in recorded programs. It uses AT&T WATSON for speech recognition to recognize spoken queries and natural language understanding to parse the speech results in order to create the appropriate search query to the video search engine. It uses the content-based video search engine MIRACLE platform to find recorded programs that meet the search criteria. It allows users to play back the video on the iPhone/iPad or on a television (TV playback requires either a Microsoft Vista PC or a set top box similar to those deployed with AT&T U-verse). As new TV programs are processed and indexed by MIRACLE, new speech models are created daily to handle the constantly changing metadata and vocabulary.

iMIRACLE extends the speech-based control premise of previous projects that control your TV with a speech-enabled remote control. To learn more about this project and other TV-based technologies, please click here.

Multimodal Input for iMIRACLE

iMIRACLE is easy to use. You can use speech, text entry, or graphical input. You can search metadata like title, genre, channel, and time or you can search enhanced metadata which allows you to search for words within the content. Or you can search for combinations of all of these. The beauty of using natural language speech is that is is easy to combine constraints in one query and the natural language understanding capability effectively parses the different components of the query. Just press the "Speak" button label and speak your query using natural language and then press the button again (labeled "Stop") when you are done speaking. The text or entry next to the button will automatically be updated. iMIRACLE detects when you have stopped speaking so pressing the "Stop" button is optional. If you are in a noisy environment or one where speech input does not make sense, you can just type in the natural language query. You can even make typos and the system will attempt to effectively correct the spelling by returning a result that most sounds like what you typed in. For example, typing "Prezident Obamma" will return "president obama".

Interface Examples

1 Below is an example interface that shows the results of the speech search query "comedy shows on n b c past week that mention president obama". This text is the recognized text that is output from the speech recognizer. The user can navigate the list of shows that meet this criteria using the graphical interface. If the user navigates to the bottom of the page (by swiping the screen), the user will see NEXT and PREVIOUS buttons in order to retrieve the next 10 results.  

iMIRACE Program Listing

3 The user can also direct the video playback to a TV connected to a properly configured Media Center PC or a U-verse set top box. Now the video will play on the TV if a thumbnail is selected. A link for TV controls will appear and the user can play, stop, pause, mute, increase or decrease volume, and toggle between full screen and the reduced screen alternatives. If your query contained content words, then playback of the video on either the iPhone/iPad or TV will start from that point on.  

iMIRACLE Program Selection

2 If the user selects one of the shows from the listing, the interface above will be displayed. Context text and representative thumbnails are shown. Just tap on the thumbnail and the video will start playing on the iPhone or iPad.  

iMIRACE Program Selection: TV output
 

Architecture

A number of processing steps need to be performed before the user can speak a search query. Both the video content and speech grammars need to be prepared, which can be tuned according to the source content. The main components of the architecture are the content-based video search engine MIRACLE and the AT&T Speech Mashup Portal. The MIRACLE platform acquires the TV content and does the media processing which extracts relevant content and indexes the content. The AT&T Speech Mashup Portal handles the speech processing in the network cloud. It uses AT&T WATSON Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU). Once the TV content is acquired, processed, and indexed, the iPhone/iPad (or any other client) can be used to search all the TV content acquired in the Media Archive. New speech models are created daily to handle the constantly changing content (Content Descriptions are derived from the index and are used to build the HLM (Hierarchical Language Model) models. These HLM models make it possible to perform natural language understanding so that the different components of the search query can be teased apart in order build the search query.

The real-time interaction of the iPhone/iPad goes as follows. The user speaks a query on the mobile device via the Speech App. The audio is sent to the Speech Mashup Portal where AT&T WATSON performs the speech recognition and natural language understanding. The client speech app on the mobile device makes the query to MIRACLE and the list of TV programs that match the query are returned as shown above. The user then navigates the GUI to play back the video on the movile device or alternatively on a properly connected TV.

iMIRACE Architecture

Technical Documents

Appearance, Visual and Social Ensembles for Face Recognition in Personal Photo Collections
Eric Zavesky, Raghuraman Gopalan, Archana Sapkota
IEEE International Conference on Biometrics: Theory, Applications and Systems,  2013.  [PDF]  [BIB]

IEEE Copyright

Multimedia (videos, demos, interviews)
Demonstration of the iMIRACLE speech-enabled content-based multimedia retrieval system  Demonstration of the iMIRACLE speech-enabled content-based multimedia retrieval system  iMIRACLE-Video-Long (1k)


Project Members

Bernard Renger

Michael Johnston

Zhu Liu

David Gibbon

Behzad Shahraray

Eric Zavesky

Related Projects

Project Space

AT&T Application Resource Optimizer (ARO) - For energy-efficient apps

Assistive Technology

CHI Scan (Computer Human Interaction Scan)

CoCITe – Coordinating Changes in Text

Connecting Your World

Darkstar

Daytona

E4SS - ECharts for SIP Servlets

Scalable Ad Hoc Wireless Geocast

AT&T 3D Lab

Graphviz System for Network Visualization

Information Visualization Research - Prototypes and Systems

Swift - Visualization of Communication Services at Scale

AT&T Natural VoicesTM Text-to-Speech

Smart Grid

Speech Mashup

Speech translation

StratoSIP: SIP at a Very High Level

Telehealth

Content Augmenting Media (CAM)

Content-Based Copy Detection

Content Acquisition Processing, Monitoring, and Forensics for AT&T Services (CONSENT)

MIRACLE and the Content Analysis Engine (CAE)

Social TV - View and Contribute to Public Opinions about Your Content Live

Visual API - Visual Intelligence for your Applications

Enhanced Indexing and Representation with Vision-Based Biometrics

Visual Semantics for Intuitive Mid-Level Representations

eClips - Personalized Content Clip Retrieval and Delivery

AT&T WATSON (SM) Speech Technologies

Wireless Demand Forecasting, Network Capacity Analysis, and Performance Optimization