Video - Analytics and Indexing

Metadata as a Proxy for Analytics and Indexing

Metadata is textual or numerical information that describes high-level properties of a piece of content. A few examples of metadata are a title, creation time, content duration, author, detected faces, etc. To be efficient and effective, a piece of metadata should generally consume fewer resources than the original data. For example, while one could create metadata for a movie that describes each frame of that movie with ten words - resulting in an astonishing 1,620,000 words total words (10 words/frame x 30 frames/second x 60 seconds/minute x 90 minutes)! A more effective description might contain information about the actors, the length of the movie, or the locations of scenes in the movie.

In the context multimedia and video content, metadata can have a wide variety of representations. Each representation creates another way that the content can be indexed (quickly accessed) by information retrieval systems, like databases. The list and illustration below provide a sample of some of the metadata representations that are created in the MIRACLE platform and are available for use in subsequent indexing, retrieval and content consumption tasks.

  • Simple metadata provided with the video (title, date, description, air date, actor information, and hypertext links to related materials).
  • Textual content captured from subtitles, transcripts, and closed captions. These forms of textual content are often the most reliable because they have been manually created by editors and content providers.
  • Textual content automatically derived from speech (dialog and narration). Speech recognition is performed by the AT&T WATSON system with a large-vocabulary speech recognition model (or grammar).  With the assistance of other textual sources, transcripts from speech recognition can help the CAE automatically learn new words such as unusual locations around the world or the latest buzz word in new technology.   
  • Visual information computed with video analysis techniques that detect changes in the scene (a fade, cut, dissolve, etc.) and perform face clustering to find recurring characters or actors in a video.
  • Speaker segmentation information allowing differentiation among speakers.  Speaker segmentation helps to identify the dialog of different people, like the president and reporters in a press release.  Segmentation also facilitates other automatic processes such as summarization and speaker recognition (the automatic association of a face with a voice).


Content Analysis

RTMM (Real-time MultiMedia Analysis) is the application of several components of the Content Analysis Engine in a real-time fashion. Metadata for video segments, speech recognition, detected faces, summarized keywords, and even visual concepts can be produced on-the-fly for just about any stream. Allowing any technology to stream and capture content for a processing instance, the RTMM system was intended for live or near-live analysis of multimedia content.

A complementary analysis system produces metadata at a granular, "event" level instead of data for an entire asset. This system, broadly referred to as Content Analytics was created with modularity for both small-footprint platforms (eg. a low-power devices next to a capture source) and cloud-based compute resources and can be run in batch and streaming modes. The Content Analytics platform also allows the user to programatically determine what type of metadata and processing modules to instantiate to better optimize runtime performance.


When used as part of a larger framework, the RTMM system produces metadata that can be used in a number of powerful systems. For example, if a user wanted to receive alerts with the relevant video clip about a specific headline, the RTMM could be used create the appropriate content playlist and trigget eClips. In another scenario if the RTMM is incorporated in a content creation stage at a service provider, it could create metadata streams for several content channels and send those to all users for their own personalized alerts. A prototype of this system was created as a service in the Content Augmenting Media (CAM) project, which not only offers an "alerting service" for current TV content, but also creates an improved EPG (electronic program guide) by providing information from the live content itself. Other projects tailored to mobile devices, summarization engines, and content recommendation could also utilize the real-time metadata streams generated by the RTMM.

Unsupervised Segmentation and Classification

BBC Rushes Diagram

As the amount and diversity of content continues to grow, a need for intelligent segmentation of video is required to understand the content and semantics of a content segement. Additionally, with the wide adoption of social content sharing sites, like YouTube, Vine, and Vimeo short-form or "snackable" content segments are popular for remixing, fast sharing, and expressing ideas quickly.

Harnessing state-of-the-art methods developed in MIRACLE and CAE platforms, content can be partitioned into small "shots" as illustrated to the right. These shots are generally consistent in content (the same scene, often a fixed camera, etc.) so they are logically ideal for subsequent semantic classification object detection, and image search and copy detection. Finally, by comparing the structure and repeatability of the pieces themself it may be possile to achieve resource savings with content summarization, which can benefit the end-user by reducing non-relevant content and saving bandwidth in transmitting the content.


Content segmentation is an important part of almost many content-related application. From making home-videos more interesting (or linking similar family videos together) to making a succinct video to share via a mobile device. Looking at large collections of personal content, complete photo and video sharing applications like VidCat use content segmentation to create a more streamlined and enjoyable user experience.