Video - Indexing and Representation (Metadata)

Metadata as a Proxy for Representation and Indexing

Metadata is textual or numerical information that describes high-level properties of a piece of content. A few examples of metadata are a title, creation time, content duration, author, detected faces, etc. To be efficient and effective, a piece of metadata should generally consume fewer resources than the original data. For example, while one could create metadata for a movie that describes each frame of that movie with ten words - resulting in an astonishing 1,620,000 words total words (10 words/frame x 30 frames/second x 60 seconds/minute x 90 minutes)! A more effective description might contain information about the actors, the length of the movie, or the locations of scenes in the movie.

In the context multimedia and video content, metadata can have a wide variety of representations. Each representation creates another way that the content can be indexed (quickly accessed) by information retrieval systems, like databases. The list and illustration below provide a sample of some of the metadata representations that are created in the MIRACLE platform and are available for use in subsequent indexing, retrieval and content consumption tasks.

  • Simple metadata provided with the video (title, date, description, air date, actor information, and hypertext links to related materials).
  • Textual content captured from subtitles, transcripts, and closed captions. These forms of textual content are often the most reliable because they have been manually created by editors and content providers.
  • Textual content automatically derived from speech (dialog and narration). Speech recognition is performed by the AT&T WATSON system with a large-vocabulary speech recognition model (or grammar).  With the assistance of other textual sources, transcripts from speech recognition can help the CAE automatically learn new words such as unusual locations around the world or the latest buzz word in new technology.   
  • Visual information computed with video analysis techniques that detect changes in the scene (a fade, cut, dissolve, etc.) and perform face clustering to find recurring characters or actors in a video.
  • Speaker segmentation information allowing differentiation among speakers.  Speaker segmentation helps to identify the dialog of different people, like the president and reporters in a press release.  Segmentation also facilitates other automatic processes such as summarization and speaker recognition (the automatic association of a face with a voice).


Real-time Multimedia Analysis

RTMM or Real-time MultiMedia Analysis is the application of several components of the Content Analysis Engine in a real-time fashion. That means metadata for video segments, speech recognition, detected faces, summarized keywords, and even visual concepts can be produced on-the-fly for just about any stream. Allowing any technology to stream and capture content for a processing instance, the RTMM system was intended for live or near-live analysis of multimedia content.


When used as part of a larger framework, the RTMM system produces metadata that can be used in a number of powerful systems. For example, if a user wanted to receive alerts with the relevant video clip about a specific headline, the RTMM could be used create the appropriate content playlist and trigget eClips. In another scenario if the RTMM is incorporated in a content creation stage at a service provider, it could create metadata streams for several content channels and send those to all users for their own personalized alerts. A prototype of this system was created as a service in the Content Augmenting Media (CAM) project, which not only offers an "alerting service" for current TV content, but also creates an improved EPG (electronic program guide) by providing information from the live content itself. Other projects tailored to mobile devices, summarization engines, and content recommendation could also utilize the real-time metadata streams generated by the RTMM.

Unsupervised Segmentation and Classification

More information coming, thanks for your patience!

Innovations in Standards and Protocol Definitions

meta_atislogo meta_dlnalogo meta_mpeg7logo meta_rsslogo

The Alliance for Telecommunications Industry Solutions (ATIS) develops standards for a broad range of communications applications. The ATIS IPTV Interoperability Forum (IIF) is a subgroup focused on advanced television services delivered over managed networks to connected TVs, set-top boxes, and mobile devices.

The scope of the work includes delivery of HD and 3D live TV programming over multicast IP transport, targeted advertising, video and other content on demand, and DVR capabilities. Rigorous content security protocols and detailed quality of service metrics are defined and the services support broadcast requirements for accessibility and emergency alerting.

Data models are defined for content description, program guides, user preferences, etc., and are represented in XML schemas to ensure interoperability. These schemas are harmonious with existing industry standards such as OMA BCAST and MPEG-7. All forms of the Content Analysis Engine support MPEG-7 representation of extracted metadata, enabling advanced video services in a standards compliant manor.