iMIRACLE uses large vocabulary speech recognition to recognize metadata words (titles, genre, channels, etc.) and content words that occur in recorded programs. It uses AT&T WATSON for speech recognition to recognize spoken queries and natural language understanding to parse the speech results in order to create the appropriate search query to the video search engine. It uses the content-based video search engine MIRACLE platform to find recorded programs that meet the search criteria. It allows users to play back the video on the iPhone/iPad or on a television (TV playback requires either a Microsoft Vista PC or a set top box similar to those deployed with AT&T U-verse). As new TV programs are processed and indexed by MIRACLE, new speech models are created daily to handle the constantly changing metadata and vocabulary.
iMIRACLE extends the speech-based control premise of previous projects that control your TV with a speech-enabled remote control. To learn more about this project and other TV-based technologies, please click here.
iMIRACLE is easy to use. You can use speech, text entry, or graphical input. You can search metadata like title, genre, channel, and time or you can search enhanced metadata which allows you to search for words within the content. Or you can search for combinations of all of these. The beauty of using natural language speech is that is is easy to combine constraints in one query and the natural language understanding capability effectively parses the different components of the query. Just press the "Speak" button label and speak your query using natural language and then press the button again (labeled "Stop") when you are done speaking. The text or entry next to the button will automatically be updated. iMIRACLE detects when you have stopped speaking so pressing the "Stop" button is optional. If you are in a noisy environment or one where speech input does not make sense, you can just type in the natural language query. You can even make typos and the system will attempt to effectively correct the spelling by returning a result that most sounds like what you typed in. For example, typing "Prezident Obamma" will return "president obama".
1 Below is an example interface that shows the results of the speech search query "comedy shows on n b c past week that mention president obama". This text is the recognized text that is output from the speech recognizer. The user can navigate the list of shows that meet this criteria using the graphical interface. If the user navigates to the bottom of the page (by swiping the screen), the user will see NEXT and PREVIOUS buttons in order to retrieve the next 10 results.
3 The user can also direct the video playback to a TV connected to a properly configured Media Center PC or a U-verse set top box. Now the video will play on the TV if a thumbnail is selected. A link for TV controls will appear and the user can play, stop, pause, mute, increase or decrease volume, and toggle between full screen and the reduced screen alternatives. If your query contained content words, then playback of the video on either the iPhone/iPad or TV will start from that point on.
2 If the user selects one of the shows from the listing, the interface above will be displayed. Context text and representative thumbnails are shown. Just tap on the thumbnail and the video will start playing on the iPhone or iPad.
A number of processing steps need to be performed before the user can speak a search query.
Both the video content and speech grammars need to be prepared, which can be tuned
according to the source content.
The main components of the architecture are the content-based video search engine MIRACLE
and the AT&T Speech Mashup Portal.
The MIRACLE platform acquires the TV content and does the media processing which extracts relevant content and indexes the content.
The AT&T Speech Mashup Portal handles the speech processing in the network cloud.
It uses AT&T WATSON Automatic Speech Recognition (ASR) and
Natural Language Understanding (NLU).
Once the TV content is acquired, processed, and indexed, the iPhone/iPad (or any other client) can be used to search
all the TV content acquired in the Media Archive. New speech models are created daily
to handle the constantly changing content (Content Descriptions are derived from the index and are used to
build the HLM (Hierarchical Language Model) models. These HLM models make it possible to perform natural
language understanding so that the different components of the search query can be teased apart in order
build the search query.
The real-time interaction of the iPhone/iPad goes as follows. The user speaks a query on the mobile device via the Speech App. The audio is sent to the Speech Mashup Portal where AT&T WATSON performs the speech recognition and natural language understanding. The client speech app on the mobile device makes the query to MIRACLE and the list of TV programs that match the query are returned as shown above. The user then navigates the GUI to play back the video on the movile device or alternatively on a properly connected TV.
Appearance, Visual and Social Ensembles for Face Recognition in Personal Photo Collections
Eric Zavesky, Raghuraman Gopalan, Archana Sapkota
IEEE International Conference on Biometrics: Theory, Applications and Systems, 2013. [PDF] [BIB]
Multimedia (videos, demos, interviews)
Demonstration of the iMIRACLE speech-enabled content-based multimedia retrieval system Demonstration of the iMIRACLE speech-enabled content-based multimedia retrieval system iMIRACLE-Video-Long (1k)